Speeding up gather - c++

I have a computation that produces a coefficient vector and returns the dot product of this vector with a data vector taken from a large array. To speed things up, I do this for eight vectors at a time using AVX2 SIMD intrinsics. The problem is that the bulk of the time ends up being consumed by the gather operation getting the data for the dot product.
I tried different ways of implementing the gather, and the intrinsic seems to work best. I would greatly appreciate some advice on speeding this up.
Here is a sketch:
__m256 Compute(__m256 input)
{
__m256 coefficients[56] = ComputeCoefficients(input);
__m256i indices = ComputeIndices(input);
__m256 sum = _mm256_setzero_ps();
for (size_t i = 0; i != 56; ++i)
{
__m256 data = _mm256_i32gather_ps(bigArray + i, indices, sizeof(float)); // 😴
sum = _m256_fmadd_ps(coefficients[i], data, sum);
}
return sum;
}

I would first make sure that you are using the most recent Intel processor possible. Intel has invested a lot of engineering in improving the gather instruction.
This being said, it is not magical. If there are cache misses, you will pay a price for them.
I would try to write the same code without SIMD instructions. Is it about the same speed? If it is, then chances are good that your are limited by memory access. Vectorization is good to solve computational limitations, and to write and store data in vector-size units, but even in principle, it cannot be expected to help much with problems bound by random access and cache issues.
Your code repeatedly calls VPGATHERDPS. According to Agner Fog, this instruction has a latency of 12 cycles and a throughput of one instruction every 4 cycles. The latency is, of course, a best-case scenario, cache misses will increase the latency.
You should benchmark your code and ensure that you are close to 4 cycles per loop iteration. The main loop should complete in about 300 cycles, and that's quite fast all things said.
You do not tell us a lot about your problem but we can guess that it is much slower than 300 cycles. If so, then you are probably having cache issues. If your table is large and you are accessing it randomly, then it is a hard problem. If you need better performance, you may need to reengineer the problem.

Related

How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

Using one loop vs two loops

I was reading this blog :- https://developerinsider.co/why-is-one-loop-so-much-slower-than-two-loops/. And I decided to check it out using C++ and Xcode. So, I wrote a simple program given below and when I executed it, I was surprised by the result. Actually the 2nd function was slower compared to the first function contrary to what is stated in the article. Can anyone please help me figure out why this is the case?
#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;
void function1() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0;j<n;j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
void function2() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0; j<n; j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0; j<n; j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
int main(int argc, const char * argv[]) {
function1();
function2();
return 0;
}
TL;DR: The loops are basically the same, and if you are seeing differences, then your measurement is wrong. Performance measurement and more importantly, reasoning about performance requires a lot of computer knowledge, some scientific rigor, and much engineering acumen. Now for the long version...
Unfortunately, there is some very inaccurate information in the article to which you've linked, as well as in the answers and some comments here.
Let's start with the article. There won't be any disk caching that has any effect on the performance of these functions. It is true that virtual memory is paged to disk, when demand on physical memory exceeds what's available, but that's not a factor that you have to consider for programs that touch 1.6MB of memory (4 * 4 * 100K).
And if paging comes into play, the performance difference won't exactly be subtle either. If these arrays were paged to disk and back, the performance difference would be in order of 1000x for fastest disks, not 10% or 100%.
Paging and page faults and its effect on performance is neither trivial, nor intuitive. You need to read about it, and experiment with it seriously. What little information that article has is completely inaccurate to the point of being misleading.
The second is your profiling strategy and the micro-benchmark itself. Clearly, with such simple operations on data (an add,) the bottleneck will be memory bandwidth itself (maybe instruction retire limits or something like that with such a simple loop.) And since you only read memory linearly, and use all you read, whether its in 4 interleaving streams or 2, you are making use of all the bandwidth that is available.
However, if you call your function1 or function2 in a loop, you will be measuring the bandwidth of different parts of the memory hierarchy depending on N, from L1 all the way to L3 and main memory. (You should know the size of all levels of cache on your machine, and how they work.) This is obvious if you know how CPU caches work, and really mystifying otherwise. Do you want to know how fast this is when you do it the first time, when the arrays are cold, or do you want to measure the hot access?
Is your real use case copying the same mid-sized array over and over again?
If not, what is it? What are you benchmarking? Are you trying to measure something or just experimenting?
Shouldn't you be measuring the fastest run through a loop, rather than the average since that can be massively affected by a (basically random) context switch or an interrupt?
Have you made sure you are using the correct compiler switches? Have you looked at the generated assembly code to make sure the compiler is not adding debug checks and what not, and is not optimizing stuff away that it shouldn't (after all, you are just executing useless loops, and an optimizing compiler wants nothing more than to avoid generating code that is not needed).
Have you looked at the theoretical memory/cache bandwidth number for your hardware? Your specific CPU and RAM combination will have theoretical limits. And be it 5, 50, or 500 GiB/s, it will give you an upper bound on how much data you can move around and work with. The same goes with the number of execution units, the IPC or your CPU, and a few dozen other numbers that will affect the performance of this kind of micro-benchmark.
If you are reading 4 integers (4 bytes each, from a, b, c, and d) and then doing two adds and writing the two results back, and doing it 100'000 times, then you are - roughly - looking at 2.4MB of memory read and write. If you do it 10 times in 300 micro-seconds, then your program's memory (well, store buffer/L1) throughput is about 80 GB/s. Is that low? Is that high? Do you know? (You should have a rough idea.)
And let me tell you that the other two answers here at the time of this writing (namely this and this) do not make sense. I can't make heads nor tails of the first one, and the second one is almost completely wrong (conditional branches in a 100'000-times for loop are bad? allocating an additional iterator variable is costly? cold access to array on stack vs. on the heap has "serious performance implications?)
And finally, as written, the two functions have very similar performances. It is really hard separating the two, and unless you can measure a real difference in a real use case, I'd say write whichever one that makes you happier.
If you really really want a theoretical difference between them, I'd say the one with two separate loops is very slightly better because it is usually not a good idea interleaving access to unrelated data.
This has nothing to do with caching or instruction efficiency. Simple iterations over long vectors are purely a matter of bandwidth. (Google: stream benchmark.) And modern CPUs have enough bandwidth to satisfy not all of their cores, but a good deal.
So if you combine the two loops, executing them on a single core, there is probably enough bandwidth for all loads and stores at the rate that memory can sustain. But if you use two loops, you leave bandwidth unused, and the runtime will be a little less than double.
The reasons why the second is faster in your case (I do not think that this works on any machine) is better cpu caching at the point at ,which you cpu has enough cache to store the arrays, the stuff your OS requires and so on, the second function will probably be much slower than the first.
from a performance standpoint. I doubt that the two loop code will give better performance if there are enough other programs running as well, because the second function has obviously worse efficiency then the first and if there is enough other stuff cached the performance lead throw caching will be eliminated.
I'll just chime in here with a little something to keep in mind when looking into performance - unless you are writing embedded software for a real-time device, the performance of such low level code as this should not be a concern.
In 99.9% of all other cases, they will be fast enough.

Should I use loops for small iterations? [duplicate]

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)
Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.
Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.
As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.
Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.
Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.
Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)
Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.
Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

How to demonstrate the impact of instruction cache limitations

My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.
volatile int checksum;
void (*funcs[MAX_FUNCS])(void);
template <unsigned t>
__attribute__ ((noinline)) static void work(void) { ++checksum; }
template <unsigned t>
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }
template <> void create<0>(void) { }
int main()
{
create<MAX_FUNCS>();
for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
{
checksum = 0;
for (unsigned i = 0; i < WORKLOAD; ++i)
{
funcs[i % range]();
}
}
return 0;
}
The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...
The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.
The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?
I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and no compiler optimization.
Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!
First, let me say that I really like how you've approached this problem, this is a really neat solution for intentional code bloating. However, there might still be several possible issues with your test -
You also measure the warmup time. you didn't show where you've placed your time checks, but if it's just around the internal loop - then the first time until you reach range/2 you'd still enjoy the warmup of the previous outer iteration. Instead, measure only warm performance - run each internal iteration for several times (add another loop in the middle), and take the timestamp only after 1-2 rounds.
You claim to have measure several cache levels, but your L1 cache is only 32k, which is where your graph ends. Even assuming this counts in terms of "range", each function is ~21 bytes (at least on my gcc 4.8.1), so you'll reach at most 256KB, which is only then scratching the size of your L2.
You didn't specify your CPU model (i7 has at least 4 generations in the market now, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example an additional uop-cache since Sandybrige with complicated storage rules and conditions. Your baseline is also complicating things, if I recall correctly the P4 had a trace cache which might also cause all sorts of performance impacts. You should check an option to disable them if possible.
Don't forget the TLB - even though it probably doesn't play a role here in such a tightly organized code, the number of unique 4k pages should not exceed the ITLB (128 entries), and even before that you may start having collisions if your OS did not spread the physical code pages well enough to avoid ITLB collisions.

why the memory access is so slow?

I've got very strange performance issue, related to access to memory.
The code snippet is:
#include <vector>
using namespace std;
vector<int> arrx(M,-1);
vector< vector<int> > arr(N,arrx);
...
for(i=0;i<N;i++){
for(j=0;j<M;j++){
//>>>>>>>>>>>> Part 1 <<<<<<<<<<<<<<
// Simple arithmetic operations
int n1 = 1 + 2; // does not matter what (actually more complicated)
// Integer assignment, without access to array
int n2 = n1;
//>>>>>>>>>>>> Part 2 <<<<<<<<<<<<<<
// This turns out to be most expensive part
arr[i][j] = n1;
}
}
N and M - are some constants of the order of 1000 - 10000 or so.
When I compile this code (release version) it takes approximately 15 clocks to finish it if the Part 2 is commented. With this part the execution time goes up to 100+ clocks, so almost 10 times slower. I expected the assignment operation to be much cheaper than even simple arithmetic operations. This is actually true if we do not use arrays. But with that array the assignment seems to be much more expensive.
I also tried 1-D array instead of 2-D - the same result (for 2D is obviously slower).
I also used int** or int* instead of vector< vector< int > > or vector< int > - again the result is the same.
Why do I get such poor performance in array assignment and can I fix it?
Edit:
One more observation: in the part 2 of the given code if we change the assignment from
arr[i][j] = n1; // 172 clocks
to
n1 = arr[i][j]; // 16 clocks
the speed (numbers in comments) goes up. More interestingly, if we change the line:
arr[i][j] = n1; // 172 clocks
to
arr[i][j] = arr[i][j] * arr[i][j]; // 110 clocks
speed is also higher than for simple assignment
Is there any difference in reading and writing from/to memory? Why do I get such strange performance?
Thanks in advance!
Your assumptions are really wrong...
Writing to main memory is much slower than doing a simple addition.
If you don't do the write, it is
likely that the loop will be optimized away entirely.
Unless your actual "part 1" is significantly more complex than your example would lead us to believe, there's no surprise here -- memory access is slow compared to basic arithmetic. Additionally, if you're compiling with optimizations, most or all of "part 1" may be getting optimized away because the results are never used.
You have discovered a sad truth about modern computers: memory access is really slow compared to arithmetic. There's nothing you can do about it. It is because electric fields in a copper wire only travel at about two-thirds of the speed of light.
Your nested arrays are going to be somewhere in the region of 50–500MB of memory. It takes time to write that much memory, and no amount of clever hardware memory caching is going to help that much. Moreover, even a single memory write will take time as it has to make its way over some copper wires on a circuit board to a lump of silicon some distance away; physics wins.
But if you want to dig in more, try the cachegrind tool (assuming it's present on your platform). Just be aware that the code you're using above doesn't really permit a lot of optimization; its data access pattern doesn't have enormous amounts of reuse potential.
Let's make a simple estimate. A typical CPU clock speed nowadays is about 1--2 GHz (giga=10 to power nine). Simplifying a (really) great deal, it means that a single processor operation takes about 1ns (nano=10 to power negative nine). A simple arithmetic like int addition takes several CPU cycles, of order ten.
Now, memory: typical memory access time is about 50ns (again, it's not necessary just now to go into gory details, which are aplenty).
You see that even in the very best case scenario, the memory is slower than CPU by a factor of about 5 to 10.
In fact, I'm sweeping under the rug an enormous amount of detail, but I hope the basic idea is clear. If you're interested, there are books around (keywords: cache, cache misses, data locality etc). This one is dated, but still very good at the explaining general concepts.