Worse performance with multiple threads in C++ [duplicate] - c++

There is a really interesting note here: http://en.cppreference.com/w/cpp/chrono/c/clock
"Only the difference between two values returned by different calls to std::clock is meaningful, as the beginning of the std::clock era does not have to coincide with the start of the program. std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock."
Why does the clock speed up with multithreading? I'm checking the performance of a C++ program with threading vs without it and I'm noticing that the times are similar for threading (not better) but feel faster (like saying 8 seconds in 3 seconds of runtime).

If more than one core is available, and you are running multiple threads, then potentially multiple threads are executing at the same time on different cores. Since clock() measures processor time, it may advance faster than wallclock time, because multiple threads are advancing it simultaneously.
Just as the example given in the documentation - it shows two threads created, and the clock() value reported is almost double the wallclock time reported.

Related

clock and time in c++, how clock/CLOCKS_PER_SEC is bigger than time? [duplicate]

How does the clock function work behind the scene in a multithreading program?
AFAIK, the clock function provided by the C standard library could be used to measure the CPU time consumed by the running program, but how does this process work under the hood? Is the underlying hardware timer part of the CPU chip? If it's not, how does the clock know when the CPU is scheduled to run the current program? Since the clock function only records the time consumption of each individual CPU that executes the current program. Meaning in a multi-threaded program, the returned value would be lower than the wall-clock time.
*Although I raised another question in What's the relationship between the real CPU frequency and the clock_t in C?, but actually, those are two different topics, so, I don't want to mix them up in one post.
Since the clock function only records the time consumption of each individual CPU that executes the current program. Meaning in a
multi-threaded program, the returned value would be lower than the
wall-clock time.
I can't find a definitive statement in a C Standard but, according to cppreference (which is generally very reliable), your assumption is wrong and the clock() function returns the total (for all CPUs) processor time used by the program (bold emphasis mine):
… For example, if the CPU is shared by other processes, clock
time may advance slower than wall clock. On the other hand, if the
current process is multithreaded and more than one execution core is
available, clock time may advance faster than wall clock.

Execution time of thread changes

I'm using this method: Measure execution time in C (on Windows) to measure the execution time of all threads. I'm on Windows and I'm using VC++.
Something like this:
//initialize start, end, frequency
std::thread thread1(...);
// and so on I initialise all the threads
thread1.join();
//join all threads
//time = (start-end)/frequency
printf(time);
I run it multiple times and I often get different execution time. I have like 30% variation.(which means I get values between 100s and 70s). I don't understand why ? 30% is not I small variation.
Thanks
Many things can affect an execution time, from system load to memory usage to the stars aligning. You don't specify the order of your execution time, but for lower values a 30% deviation is certainly not unheard of. The main take away is that an execution time on its own basically tells us nothing about the system.
QueryPerformanceCounter is a system-wide counter, so your result depends on system load and the particulars of the process scheduling.
You could use QueryProcessCycleTime, which returns
The number of CPU clock cycles used by the threads of the process. This value includes cycles spent in both user mode and kernel mode.
You can get a handle to the current process using GetCurrentProcess.

Time difference between execution of two statements is not consistent

Could you please tell me why the value of timediff printed by the following program is often 4 microseconds (in the range 90 to 1000 times for different runs), but sometimes 70 or more microseconds for a few cases (in the range of 2 to 10 times for different runs):
#include <iostream>
using namespace std;
#include<sys/time.h>
#define MAXQ 1000000
#define THRDS 3
double GetMicroSecond()
{
timeval tv;
gettimeofday (&tv, NULL);
return (double) (((double)tv.tv_sec * 1000000) + (double)tv.tv_usec);
}
int main()
{
double timew, timer, timediff;
bool flagarray[MAXQ];
int x=0, y=0;
for(int i=0; i<MAXQ; ++i)
flagarray[i] = false;
while(y <MAXQ)
{
x++;
if(x%1000 == 0)
{
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
timediff = timer - timew;
if(timediff > THRDS) cout << timer-timew << endl;
}
}
}
Compiled using: g++ testlatency.cpp -o testlatency
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
The statement flagarray[y++]=true; will take much less than a microsecond to execute on a modern computer if flagarray[y++] happens to be in the level 1 cache. The statement will take longer to execute if that location is in level 2 cache but not in level 1 cache, much longer if it is in level 3 cache but not in level 1 or level 2 cache, and much, much longer yet if it isn't in any of the caches.
Another thing that can make timer-timew exceed three milliseconds is when your program yields to the OS. Cache misses can result in a yield. So can system calls. The function gettimeofday is a system call. As a general rule, you should expect any system call to yield.
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
This is not true. There are always many other programs, and many, many other threads running on your 12 core computer. These include the operating system itself (which comprises many threads in and of itself), plus lots and lots of little daemons. Whenever your program yields, the OS can decide to temporarily suspend your program so that one of the myriad other threads that are temporarily suspended but are asking for use of the CPU.
One of those daemons is the Network Time Protocol daemon (ntpd). This does all kinds of funky little things to your system clock to keep it close to in sync with atomic clocks. With a tiny little instruction such as flagarray[y++]=true being the only thing between successive calls to gettimeofday, you might even see time occasionally go backwards.
When testing for timing, its a good idea to do the timing at a coarse level. Don't time an individual statement that doesn't involve any function calls. It's much better to time a loop than it is to time than it is to time individual executions of the loop body. Even then, you should expect some variability in timing because of cache misses and because the OS temporarily suspends execution of your program.
Modern Unix-based systems have better timers (e.g., clock_gettime) than gettimeofday that are not subject to changes made by the Network Time Protocol daemon. You should use one of these rather than gettimeofday.
Generally, there are many threads sharing a small number of cores. Unless you take steps to ensure that your thread has uninterrupted use of a core, you can't guarantee that the OS won't decide to preempt your thread between the two calls GetMicroSecond() calls, and let some other thread use the core for a bit.
Even if your code runs uninterrupted, the line you're trying to time:
flagarray[y++]=true;
likely takes much less time to execute than the measurement code itself.
There are many things happening inside of modern OS at the same time as Your program executes. Some of them may may "steal" CPU from Your program as it is stated in NPE's answer. A few more examples of what can influence timing:
interrups from devices (timer, HDD, network interfaces a few to mention);
access to RAM (caching)
None of these are easily predictable.
You can expect consistency if You run Your code on some microcontroller, or maybe using real time OS.
There are a lot of variables that might explain different time values seen. I would focus more on
Cache miss/fill
Scheduler Events
Interrupts
bool flagarray[MAXQ];
Since you defined MAXQ to 1000000, let's assume that flagarray takes up 1MB of space.
You can compute how many cache-misses can occur, based on your L1/L2 D-cache sizes. Then you can correlate with how many iterations it takes to fill all of L1 and start missing and same with L2. OS may deschedule your process and reschedule it - but, that I am hoping is less likely due to the number of cores you have. Same is the case with interrupts. An idle system is never completely idle. You may choose to affine your process to a core number, say N by doing
taskset 0x<MASK> ./exe and control its execution.
If you are really curious, I would suggest that you use "perf" tool available on most Linux distros.
You may do
perf stat -e L1-dcache-loadmisses
or
perf stat -e LLC-load-misses
Once you have these numbers and the number of iterations you start building a picture of the activity that causes the noticed lag. You may also monitor OS scheduler events using "perf stat".

What should I check: cpu time or wall time?

I have two algorithms to do the same task. To examine their performance, what should I check: cpu time or wall time? I think it is cpu time, right?
I am doing parallelism of my code. To check my parallelism performance, what should I check: cpu time or wall time? I think it is wall time, right?
Assume I have done an ideal parallelism using multi-threads. I think the cpu time for 1 thread will be same as 8 threads, and the wall time for 1 threas will be 8 times longer than the 8 threads. Is it right?
Also any easy way to check those times?
The answer depends on what you're really trying to measure.
If you have a couple small code sequences where each runs on a single CPU (i.e., it's basically single-threaded) and you want to know which is faster, you probably want CPU time. This will tell you the time taken to execute that code, without counting other things like I/O, task switches, time spent on other processes, interrupt handling, etc. [Note: although it attempts to ignore other facts, you'll still usually get the most accurate results with the system otherwise as quiescent as possible.]
If you're writing multi-threaded code and want to measure how well you're distributing your code across processors/cores, you'll probably measure both CPU time and wall time, and compare the two. If, for example, you have 4 cores available, your ideal would be that the wall time is 1/4th the CPU time.
So, for multithreaded code you'll often end up doing things in two phases: first you look at the time to execute on thread, using CPU time. You optimize to get that to (reasonable) minimum. Then in the second phase, you compare wall time to CPU time, to try to use multiple cores efficiently. Since changing one often affects the other, you may well iterate through the two a number of times (and often compromise between the two to some degree).
Just as a really general rule of thumb, you tend to use CPU time to measure microscopic benchmarks of individual bits of code, and wall time for larger (system-level) benchmarks. In other words, when you want to measure how fast one piece of code runs, and nothing else, CPU time generally make the most sense. When you want to include the effects of things like disk I/O time, caching, etc., then you're a lot more likely to care about wall time.
Wall time tells you how long your computer took. But it doesn't tell you anything about how long the execution of your code took as it depends on other things that were keeping your computer busy.
There are different mechanisms for measuring CPU time spent executing your code - I personally like getrusage()

No performance improvement when multithreading linear regression using boost c++ libraries

I am performing calls on a method using multiple threads via boost libraries. I received quite a performance enhancement doing so. I've recently introduced linear regression calculations into the method and am having a severe per thread performance penalty.
For instance, if I run a single thread, the average method call takes 2 seconds. If I use two threads, I register twice as much CPU activity, but the average method call takes 5-6 seconds. This continues as I increase threads. There are no known race conditions or (I think) significant shared memory.
It almost seems if there is some cache or other CPU hardware feature that is being utilized by all the threads, becoming a bottleneck. But I don't know enough about CPU architecture to sure. I am running an Intel Xeon 25-2620 CPU.
Help is desperately needed.