Could you please tell me why the value of timediff printed by the following program is often 4 microseconds (in the range 90 to 1000 times for different runs), but sometimes 70 or more microseconds for a few cases (in the range of 2 to 10 times for different runs):
#include <iostream>
using namespace std;
#include<sys/time.h>
#define MAXQ 1000000
#define THRDS 3
double GetMicroSecond()
{
timeval tv;
gettimeofday (&tv, NULL);
return (double) (((double)tv.tv_sec * 1000000) + (double)tv.tv_usec);
}
int main()
{
double timew, timer, timediff;
bool flagarray[MAXQ];
int x=0, y=0;
for(int i=0; i<MAXQ; ++i)
flagarray[i] = false;
while(y <MAXQ)
{
x++;
if(x%1000 == 0)
{
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
timediff = timer - timew;
if(timediff > THRDS) cout << timer-timew << endl;
}
}
}
Compiled using: g++ testlatency.cpp -o testlatency
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
The statement flagarray[y++]=true; will take much less than a microsecond to execute on a modern computer if flagarray[y++] happens to be in the level 1 cache. The statement will take longer to execute if that location is in level 2 cache but not in level 1 cache, much longer if it is in level 3 cache but not in level 1 or level 2 cache, and much, much longer yet if it isn't in any of the caches.
Another thing that can make timer-timew exceed three milliseconds is when your program yields to the OS. Cache misses can result in a yield. So can system calls. The function gettimeofday is a system call. As a general rule, you should expect any system call to yield.
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
This is not true. There are always many other programs, and many, many other threads running on your 12 core computer. These include the operating system itself (which comprises many threads in and of itself), plus lots and lots of little daemons. Whenever your program yields, the OS can decide to temporarily suspend your program so that one of the myriad other threads that are temporarily suspended but are asking for use of the CPU.
One of those daemons is the Network Time Protocol daemon (ntpd). This does all kinds of funky little things to your system clock to keep it close to in sync with atomic clocks. With a tiny little instruction such as flagarray[y++]=true being the only thing between successive calls to gettimeofday, you might even see time occasionally go backwards.
When testing for timing, its a good idea to do the timing at a coarse level. Don't time an individual statement that doesn't involve any function calls. It's much better to time a loop than it is to time than it is to time individual executions of the loop body. Even then, you should expect some variability in timing because of cache misses and because the OS temporarily suspends execution of your program.
Modern Unix-based systems have better timers (e.g., clock_gettime) than gettimeofday that are not subject to changes made by the Network Time Protocol daemon. You should use one of these rather than gettimeofday.
Generally, there are many threads sharing a small number of cores. Unless you take steps to ensure that your thread has uninterrupted use of a core, you can't guarantee that the OS won't decide to preempt your thread between the two calls GetMicroSecond() calls, and let some other thread use the core for a bit.
Even if your code runs uninterrupted, the line you're trying to time:
flagarray[y++]=true;
likely takes much less time to execute than the measurement code itself.
There are many things happening inside of modern OS at the same time as Your program executes. Some of them may may "steal" CPU from Your program as it is stated in NPE's answer. A few more examples of what can influence timing:
interrups from devices (timer, HDD, network interfaces a few to mention);
access to RAM (caching)
None of these are easily predictable.
You can expect consistency if You run Your code on some microcontroller, or maybe using real time OS.
There are a lot of variables that might explain different time values seen. I would focus more on
Cache miss/fill
Scheduler Events
Interrupts
bool flagarray[MAXQ];
Since you defined MAXQ to 1000000, let's assume that flagarray takes up 1MB of space.
You can compute how many cache-misses can occur, based on your L1/L2 D-cache sizes. Then you can correlate with how many iterations it takes to fill all of L1 and start missing and same with L2. OS may deschedule your process and reschedule it - but, that I am hoping is less likely due to the number of cores you have. Same is the case with interrupts. An idle system is never completely idle. You may choose to affine your process to a core number, say N by doing
taskset 0x<MASK> ./exe and control its execution.
If you are really curious, I would suggest that you use "perf" tool available on most Linux distros.
You may do
perf stat -e L1-dcache-loadmisses
or
perf stat -e LLC-load-misses
Once you have these numbers and the number of iterations you start building a picture of the activity that causes the noticed lag. You may also monitor OS scheduler events using "perf stat".
Related
Right now I basically have a program that uses clock to test the amount of time my program takes to do certain operations and usually it is accurate to a couple milliseconds. My question is this: If the CPU is under heavy load will I still get the same results?
Does clock only count when the CPU is working on my process?
Lets assume: multi-core CPU but a process that does not take advantage of multithreading
The function of clock depends on the OS. In windows, from a long distant decision, clock gives the elapsed time, in most other OS's (certainly Linux, MacOS and other Unix-related OS's).
Depending on what you actually want to achieve, elapsed time or CPU time may be what you want to measure.
In a system where there are other processes running, the difference between elapsed time and CPU usage may be huge (and of course, if your CPU is NOT busy running your application, e.g. waiting for network packets to go down the wire or file-data from the hard-disk), then elapsed time is "avialable" for other applications.
There are also a huge number of error factors/interference factors when there are other processes running in the same system:
If we assume that your OS supports clock as a measure of CPU-time, the precision here is not always that great - for example, it may well be accounted in terms of CPU-timer ticks, and your process may not run for the "full tick" if it's doing I/O for example.
Other processes may use "your" cpu for parts of the interrupt handling, before the OS has switched to "account for this as interrupt time", when dealing with packets over the network or hard-disk i/o for some percentage of the time [typically not huge amounts, but in a very busy system, it can be several percent of the total time], and if other processes run on "your" cpu, the time to reload the cache(s) with "your" process' data after the other process loaded it's data will be accounted on "your time". This sort of "interference" may very well affect your measurements - how much very much depends on "what else" is going on in the system.
If your process shares data [via shared memory] with another process, there will also be (again, typically a minute amount, but in extreme cases, it can be significant) some time spent on dealing with "cache-snoop requests" between your process and the other process, when your process doesn't get to execute.
If the OS is switching tasks, "half" of the time spent switching to/from your task will accounted in your process, and half in the other process being switched in/out. Again, this is usually tiny amounts, but if you have a very busy system with lots of process switches, it can add up.
Some processor types, e.g. Intel's HyperThreading also share resources with your actual core, so only SOME of the time on that core is spent in your process, and the cache-content of your process is now shared with some other process' data and instructions - meaning your process MAY get "evicted" from the cache by the other thread running on the same CPU-core.
Likewise, multicore CPU's often have a shared L3 cache that gets
affected by other processes running on the other cores of the CPU.
File-caching and other "system caches" will also be affected by other processes - so if your process is reading some file(s), and other processes also access file(s), the cache-content will be "less yours" than if the system wasn't so busy.
For accurate measurements of how much your process uses of the system resources, you need processor performance counters (and a reproducable test-case, because you probably need to run the same setup several times to ensure that you get the "right" combination of performance counters). Of course, most of these counters are ALSO system-wide, and some kinds of processing in for example interrupts and other random interference will affect the measurement, so the most accurate results will be if you DON'T have many other (busy) processes running in the system.
Of course, in MANY cases, just measuring the overall time of your application is perfectly adequate. Again, as long as you have a reproducable test-case that gives the same (or at least similar) timing each time it's run in a particular scenario.
Each application is different, each system is different. Performance measurement is a HUGE subject, and it's very hard to cover EVERYTHING - and of course, we're not here to answer extremely specific questions about "how do I get my PI-with-a-million-decimals to run faster when there are other processes running in the same system" or whatever it may be.
In addition to agreeing with the responses indicating that timings depend on many factors, I would like to bring to your attention the std::chrono library available since C++11:
#include <chrono>
#include <iostream>
int main() {
auto beg = std::chrono::high_resolution_clock::now();
std::cout << "*** Displaying Some Stuff ***" << std::endl;
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end - beg);
std::cout << "Elapsed: " << dur.count() << " microseconds" << std::endl;
}
As per the standard, this program will utilize the highest-precision clock provided by your system and will tick with microsecond resolution (there are other resolutions available; see the docs).
Sample run:
$ g++ example.cpp -std=c++14 -Wall -Wextra -O3
$ ./a.out
*** Displaying Some Stuff ***
Elapsed: 29 microseconds
While it is much more verbose than relying on the C-style std::clock(), I feel it gives you much more expressiveness, and you can hide the verbosity behind a nice interface (for example, see my answer to a previous post where I use std::chrono to build a function timer).
There are shared components in CPU like last level cache, execution units (between hardware threads within one core), so under heavy loads you will get jitter, because even if your application executed exactly the same amount of instructions, each instructions may take more cycles (waiting for memory because data was evicted from cache, available execution unit), and more cycles means more time to execute (assuming that Turbo Boost won't compensate).
If you seek for precise instrument, look at hardware counters.
It is also important to consider factors like the number of cores available on the physical CPU, hyper-threading and other BIOS settings like Turbo Boost on Intel CPUs, and threading techniques used when coding when looking at timing metrics for CPU intensive tasks.
Parallelization tools like OpenMP provide built-in functions for calculating computation and wall time like omp_get_wtime( ); which are often times more accurate than clock() in programs that make use of this type of parallelization.
I understand that a preemptive multitasking OS can interrupt a process at any "code position".
Given the following code:
int main() {
while( true ) {
doSthImportant(); // needs to be executed at least each 20 msec
// start of critical section
int start_usec = getTime_usec();
doSthElse();
int timeDiff_usec = getTime_usec() - start_usec;
// end of critical section
evalUsedTime( timeDiff_usec );
sleep_msec( 10 );
}
}
I would expect this code to usually produce proper results for timeDiff_usec, especially in case that doSthElse() and getTime_usec() don't take much time so they get interrupted rarely by the OS scheduler.
But the program would get interrupted from time to time somewhere in the "critical section". The context switch will do what it is supposed to do, and still in such a case the program would produce wrong results for the timeDiff_usec.
This is the only example I have in mind right now but I'm sure there would be other scenarios where multitasking might get a program(mer) into trouble (as time is not the only state that might be changed at re-entry).
Is there a way to ensure that measuring the time for a certain action works fine?
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Edit:
I changed the sample code to make it more precise.
I want to check the time being spent to make sure that doSthElse() doesn't take like 50 msec or so, and if it does I would look for a better solution.
Is there a way to ensure that measuring the time for a certain action works fine?
That depends on your operating system and your privilege level. On some systems, for some privilege levels, you can set a process or thread to have a priority that prevents it from being preempted by anything at lower priority. For example, on Linux, you might use sched_setscheduler to give a thread real-time priority. (If you're really serious, you can also set the thread affinity and SMP affinities to prevent any interrupts from being handled on the CPU that's running your thread.)
Your system may also provide time tracking that accounts for time spent preempted. For example, POSIX defines the getrusage function, which returns a struct containing ru_utime (the amount of time spent in “user mode” by the process) and ru_stime (the amount of time spent in “kernel mode” by the process). These should sum to the total time the CPU spent on the process, excluding intervals during which the process was suspended. Note that if the kernel needs to, for example, spend time paging on behalf of your process, it's not defined how much (if any) of that time is charged to your process.
Anyway, the common way to measure time spent on some critical action is to time it (essentially the way your question presents) repeatedly, on an otherwise idle system, throw out outlier measurements, and take the mean (after eliminating outliers), or take the median or 95th percentile of the measurements, depending on why you need the measurement.
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Too broad. There are whole books written about this subject.
I encountered a problem where my game-loop stuttered approximating once a second (variable intervals). A single frame then takes over 60ms whereas all others require less than 1ms.
After simplifying a lot I ended with the following program which reproduces the bug. It only measures the frame time and reports it.
#include <iostream>
#include "windows.h"
int main()
{
unsigned long long frequency, tic, toc;
QueryPerformanceFrequency((LARGE_INTEGER*)&frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&tic);
double deltaTime = 0.0;
while( true )
{
//if(deltaTime > 0.01)
std::cerr << deltaTime << std::endl;
QueryPerformanceCounter((LARGE_INTEGER*)&toc);
deltaTime = (toc - tic) / double(frequency);
tic = toc;
if(deltaTime < 0.01) deltaTime = 0.01;
}
}
Again one frame in many is much slower than the others. Adding the if let the error vanish (cerr is never called then). My original problem didn't contain any cerr/cout. However, I consider this as a reproduction of the same error.
cerr is flushed in every iteration, so this is not what happens to create single slow frames. I know from a profiler (Very Sleepy) that the stream internally uses a lock/critical section, but this shouldn't change anything because the program is singlethreaded.
What causes single iterations to stall that much?
Edit: I did some more tests:
Adding std::this_thread::sleep_for( std::chrono::milliseconds(7) ); and therefore reducing the process CPU utilization does not change anything.
With printf("%f\n", deltaTime); the problem vanishes (maybe because it doesn't use a mutex and memory allocation in contrast to the stream)
The design of windows does not guarantee an upper limit on any execution time, since it dynamically allocates runtime resources to all programs using some logic - for example, the scheduler will allocate resources to a process with high priority, and starve out lower priority processes in some circumstances. Programs are statistically more likely to - eventually - be affected by such things if they run tight loops and consume a lot of CPU resources. Because - again eventually - the scheduler will temporarily boost the priority of programs that are being starved and/or reduce the priority of programs that are starving others (in your case, by running a tight loop).
Making the output to std::cerr conditional doesn't change the fact of this happening - it just changes the likelihood that it will happen in a specified time interval, because it changes how the program uses system resources in the loop, and therefore changes how it interacts with the system scheduler, policies, etc.
This sort of thing affects programs running in all non-realtime operating systems, although the precise impact depends on how each OS is implemented (e.g. scheduling strategies, other policies controlling access by programs to resources, etc). There is always a non-zero probability (even if it is small) of such stalls occurring.
If you want absolute guarantees of no stalls on such things, you will need a realtime operating system. These systems are designed to do things more predictably in a timing sense, but that comes with trade-offs, since it also requires your programs to be designed with the knowledge that they MUST complete execution of specified functions within specified time intervals. Realtime operating systems use different strategies, but their enforcing timing of constraints can cause the program to malfunction if the program is not designed with such things in mind.
I'm not sure about it, but it could be that the system is interrupting your main thread to let others run, and since it takes some time (I remember on my Windows XP pc the quantum was 10ms), it will stall a frame.
This is very visible because it is a single-threaded application, if you use several thread they are usually dispatched on several cores of the processor (if available), and the stalls will still be here but less important (if you implemented your application logic right).
Edit: here you can have more information about windows and linux schedulers. Basically, windows use quantums (varying from a handful of milliseconds to 120 ms on Windows Server).
Edit 2: you can see a more detailed explanation on the windows scheduler here.
In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.
This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.
The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.
Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.
In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.
I use pthread lib 2.8 and the OS kernel is Linux 2.6.37 on arm. In my program, Thread A is using pthread interfaces to set scheduling priority to the halfway point between sched_get_priority_min(policy) and sched_get_priority_max(policy).
In the thread function loop:
{
//do my work
pthread_cond_timedwait(..., ... , 15 ms)
}
I find this thread consumes about 3% CPU. If I change the timeout to 30 ms, it reduces to 1.3%. However, I can not increase the timeout. Is there a way to reduce the CPU consumption without reducing the timeout? It seems the cost is due to thread switching.
Using this sort of construct will cause approximately 67 task switches per second, and most likely that switches to a different process, which means a complete context switch including page-tables. It's been a while since I looked at what that involves in ARM, but I'm sure it's not a "Lightweight" operation. If we count backwards, 1.75% of that is about 210k clock cycles per task switch. Which seems quite a lot. But I'm not sure how much work is involved in scrubbing TLB's, caches and such.