How does OpenMP actually reduce clock cycles? - concurrency

It might be a silly question but, with OpenMP you can achieve to distribute the number of operations between all the cores your CPU has. Of course, it is going to be faster in 99% times because you went from a single core doing N operations to K cores doing the same amount operations at the same time.
Despite of this, the total amount of clock cycles should be the same, right? Because the number of operations is the same. Or I am wrong?

This question boils down more or less to the difference between CPU time and elapsed time. Indeed, we see more often than none here questions which start by "my code doesn't scale, why?", for which the first answer is "How did you measure the time?" (I let you make a quick search and I'm sure you'll find many results)
But to illustrate more how things work, let's imagine you have a fixed-size problem, for which you have an algorithm that is perfectly parallelized.You have 120 actions to do, each taking 1 second. Then, 1 CPU core would take 120s, 2 cores would take 60s, 3 cores 40s, etc.
That is the elapsed time that is decreasing. However, 2 cores, running for 60 seconds in parallel, will consume 120s of CPU time. This means that the overall number of clock cycles won't have reduced compared to having only one CPU core running.
In summary, for a perfectly parallelized problem, you expect to see your elapsed time scaling down perfectly with the number of cores used, and the CPU time to remain constant.
In reality, what you often see is the elapsed time to scale down less than expected, due to parallelization overheads and/or imperfect parallelization. By the meantime, you see the CPU time slightly increasing with the number of cores used, for the same reasons.

I think the answer depends on how you define the total amount of clock cycles. If you define it as the sum of all the clock cycles from the different cores then you are correct and there will not be fewer clock cycles. But if you define it as the amount of clock cycles for the "main" core between initiating and completing the distributed operations then it is hopefully fewer.

Related

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA
for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?
A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:
The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.
Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.

Determining Optimum Thread Count

So, as part of a school assignment, we are being asked to determine what our optimum thread count is for our personal computers by constructing a toy program.
To start, we are to create a task that takes between 20 and 30 seconds to run. I chose to do a coin toss simulation, where the total number of heads and tails are accumulated and then displayed. On my machine, 300,000,000 tosses on one thread ended up at 25 seconds. After that, I went to 2 threads, then 4, then 8, 16, 32, and, just for fun, 100.
Here are the results:
* Thread Tosses per thread time(seconds)
* ------------------------------------------
* 1 300,000,000 25
* 2 150,000,000 13
* 4 75,000,000 13
* 8 37,500,000 13
* 16 18,750,000 14
* 32 9,375,000 14
* 100 3,000,000 14
And here is the code I'm using:
void toss()
{
int heads = 0, tails = 0;
default_random_engine gen;
uniform_int_distribution<int> dist(0,1);
int max =3000000; //tosses per thread
for(int x = 0; x < max; ++x){(dist(gen))?++heads:++tails;}
cout<<heads<<" "<<tails<<endl;
}
int main()
{
vector<thread>thr;
time_t st, fin;
st = time(0);
for(int i = 0;i < 100;++i){thr.push_back(thread(toss));} //thread count
for(auto& thread: thr){thread.join();}
fin = time(0);
cout<<fin-st<<" seconds\n";
return 0;
}
Now for the main question:
Past a certain point, I would've expected there to be a considerable decline in computing speed as more threads were added, but the results don't seem to show that.
Is there something fundamentally wrong with my code that would yield these sorts of results, or is this behavior considered normal? I'm very new to multi-threading, so I have a feeling it's the former....
Thanks!
EDIT: I am running this on a macbook with a 2.16 GHz Core 2 Duo (T7400) processor
Your results seem very normal to me. While thread creation has a cost, its not that much (especially compared to the per-second granularity of your tests). An extra 100 thread creations, destructions, and possible context-switches isn't going to change your timing by more than a few milliseconds I bet.
Running on my Intel i7-4790 # 3.60 GHz I get these numbers:
threads - seconds
-----------------
1 - 6.021
2 - 3.205
4 - 1.825
8 - 1.062
16 - 1.128
32 - 1.138
100 - 1.213
1000 - 2.312
10000 - 23.319
It takes many, many more threads to get to the point at which the extra threads make a noticeable difference. Only when I get to 1,000 threads do I see that the thread-management has made a significant difference and at 10,000 it dwarfs the loop (the loop is only doing 30,000 tosses at that point).
As towards your assignment, it should be fairly straightforward to see that the optimal number of threads for your system should be the same as the available threads that can be executed at once. There's not any processing power left to execute another thread until one is either done or yielded, which doesn't help you finish faster. And, any less threads and you aren't using all available resources. My CPU has 8 threads and the chart reflects that.
Edit 2 - To further elaborate on the "lack of performance penalty" part due to popular demand:
...I would've expected there to be a considerable decline in computing
speed as more threads were added, but the results don't seem to show
that.
I made this giant chart in order to better illustrate the scaling.
To explain the results:
The blue bar illustrates the total time to do all the tosses. Although that time decreases all the way up to 256 threads, the gains from doubling the thread count gets smaller and smaller. The CPU I ran this test had 4 physical and 8 logical cores. Scaling is pretty good all the way to 4 cores and decent to 8 cores, and then it plummets. Pipeline saturation allows to get minor gains all the way to 256 but it is simply not worth it.
The red bar illustrates the time per toss. It is nearly identical for 1 and 2 threads, as the CPU pipeline hasn't reached full saturation yet. It gets a minor hit at 4 threads, it still runs fine but now the pipeline is saturated, and at 8 threads it really shows that logical threads are not the same thing as physical, that gets progressively worse pushing above 8 threads.
The green bar illustrates the overhead, or how much lower actual performance is relative to the expected double boost from doubling the threads. Pushing above the available logical cores causes the overhead to skyrocket. Note that this is mostly thread synchronization, the actual thread scheduling overhead is probably constant after a given point, there is a minimal window of activity time a thread must receive, which explains why the thread switching doesn't come to the point of overwhelming the work throughput. In fact there is no severe performance drop all the way to 4k threads, which is expected as modern systems have to be able and often run over thousand threads in parallel. And again, most of that drop is due to thread synchronization, not thread switching.
The black outline bar illustrates the time difference relative to the lowest time. At 8 threads we only lose ~14% of absolute performance from not having the pipeline oversaturated, which is a good thing because it is most cases not really worth stressing the entire system over so little. It also shows that 1 thread is only ~6 times slower than the maximum the CPU can pull off. Which gives a figure of how good logical cores are compared to physical cores, 100% extra logical cores give a 50% boost in performance, in this use case a logical thread is ~50% as good as a physical thread, which also correlates to the ~47% boost we see going from 4 to 8. Note that this is a very simply workload though, in more demanding cases it is close to 20-25% for this particular CPU, and in some edge cases there is actually a performance hit.
Edit 1 - I foolishly forgot to isolate the computational workload from the thread synchronization workload.
Running the test with little to no work reveals that for high thread counts the thread management part takes the bulk of the time. Thus the thread switching penalty is indeed very small and possibly after a certain point a constant.
And it would make a lot of sense if you put yourself in the shoes of a thread scheduler maker. The scheduler can easily be protected from being choked by an unreasonably high switching to working ratio, so there is likely a minimal window of time the scheduler will give to a thread before switching to another, while the rest are put on hold. This ensures that the the switching to working ratio will never exceed the limits of what is reasonable. It would be much better to stall other threads than go crazy with thread switching, as the CPU will mostly be switching and doing very little actual work.
The optimal thread count is the available amount of logical CPU cores. This achieves optimal pipeline saturation.
If you use more you will suffer performance degradation due to the cost of thread context switching. The more threads, the more penalty.
If you use less, you will not be utilizing the full hardware potential.
There is also the problem of workload graining, which is very important when you utilize synchronization such as a mutex. If your concurrency is too finely grained you can experience performance drops even when going from 1 to 2 threads on a 8 thread machine. You'd want to reduce synchronization as much as possible, doing as much work as possible in between synchronizations, otherwise you can experience huge performance drops.
Note the difference between physical and logical CPU core. Processors with hyper-threading can have more than one logical core per physical core. "Secondary" logical cores do not have the same computational power as the "primary" as they are merely used to utilize vacancies in the processor pipeline usage.
For example, if you have a 4 core 8 thread CPU, in the case of a perfectly scaling workload you will see 4 times increase of performance going from 1 to 4 threads, but a lot less going from 4 to 8 threads, as evident from vu1p3n0x's answer.
You can look here for ways to determine the number of available CPU cores.

What should I check: cpu time or wall time?

I have two algorithms to do the same task. To examine their performance, what should I check: cpu time or wall time? I think it is cpu time, right?
I am doing parallelism of my code. To check my parallelism performance, what should I check: cpu time or wall time? I think it is wall time, right?
Assume I have done an ideal parallelism using multi-threads. I think the cpu time for 1 thread will be same as 8 threads, and the wall time for 1 threas will be 8 times longer than the 8 threads. Is it right?
Also any easy way to check those times?
The answer depends on what you're really trying to measure.
If you have a couple small code sequences where each runs on a single CPU (i.e., it's basically single-threaded) and you want to know which is faster, you probably want CPU time. This will tell you the time taken to execute that code, without counting other things like I/O, task switches, time spent on other processes, interrupt handling, etc. [Note: although it attempts to ignore other facts, you'll still usually get the most accurate results with the system otherwise as quiescent as possible.]
If you're writing multi-threaded code and want to measure how well you're distributing your code across processors/cores, you'll probably measure both CPU time and wall time, and compare the two. If, for example, you have 4 cores available, your ideal would be that the wall time is 1/4th the CPU time.
So, for multithreaded code you'll often end up doing things in two phases: first you look at the time to execute on thread, using CPU time. You optimize to get that to (reasonable) minimum. Then in the second phase, you compare wall time to CPU time, to try to use multiple cores efficiently. Since changing one often affects the other, you may well iterate through the two a number of times (and often compromise between the two to some degree).
Just as a really general rule of thumb, you tend to use CPU time to measure microscopic benchmarks of individual bits of code, and wall time for larger (system-level) benchmarks. In other words, when you want to measure how fast one piece of code runs, and nothing else, CPU time generally make the most sense. When you want to include the effects of things like disk I/O time, caching, etc., then you're a lot more likely to care about wall time.
Wall time tells you how long your computer took. But it doesn't tell you anything about how long the execution of your code took as it depends on other things that were keeping your computer busy.
There are different mechanisms for measuring CPU time spent executing your code - I personally like getrusage()

How limit cpu usage from specific process?

How i can limit cpu usage to 10% for example for specific process in windows C++?
You could use Sleep(x) - will slow down your program execution but it will free up CPU cycles
Where x is the time in milliseconds
This is rarely needed and maybe thread priorities are better solution but since you asked, what you should is:
do a small fraction of your "solid" work i.e. calculations
measure how much time step 1) took, let's say it's twork milliseconds
Sleep() for (100/percent - 1)*twork milliseconds where percent is your desired load
go back to 1.
In order for this to work well, you have to be really careful in selecting how big a "fraction" of a calculation is and some tasks are hard to split up. A single fraction should take somewhere between 40 and 250 milliseconds or so, if it takes less, the overhead from sleeping and measuring might become significant, if it's more, the illusion of using 10% CPU will disappear and it will seem like your thread is oscillating between 0 and 100% CPU (which is what happens anyway, but if you do it fast enough then it looks like you only take whatever percent). Two additional things to note: first, as mentioned before me, this is on a thread level and not process level; second, your work must be real CPU work, disk/device/network I/O usually involves a lot of waiting and doesn't take as much CPU.
That is the job of OS, you can not control it.
You can't limit to exactly 10%, but you can reduce it's priority and restrict it to use only one CPU core.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.
You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.
Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.
On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.
Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.