Intel Fortran vectorisation: vector loop cost higher than scalar - fortran

I'm testing and optimising a legacy code with Intel Fortran 15, and I have this simple loop:
do ir=1,N(lev)
G1(lev)%D(ir) = 0.d0
G2(lev)%D(ir) = 0.d0
enddo
where lev is equal to some integer.
Structures and indexes are quite complex for the compiler, but it can succeed in the task, as I can see on other lines.
Now, on the above loop, I get this from the compilation report:
LOOP BEGIN at MLFMATranslationProd.f90(38,2)
remark #15399: vectorization support: unroll factor set to 4
remark #15300: LOOP WAS VECTORIZED
remark #15462: unmasked indexed (or gather) loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 12
remark #15477: vector loop cost: 20.000
remark #15478: estimated potential speedup: 2.340
remark #15479: lightweight vector operations: 5
remark #15481: heavy-overhead vector operations: 1
remark #15488: --- end vector loop cost summary ---
LOOP END
My question is: how is it that the vector loop cost is higher than the scalar one? What can I do to go towards the estimated potential speedup?

The loop cost is an estimate of the duration of one loop iteration and it takes somewhat longer in the vectorized case, but you can process more array items in one vectorized iteration.
In your case the speedup is roughly 12 / 20 * 4 = 2.4 because you can process 4 double precision array elements in one iteration (probably the AVX instructions).

Related

OpenMP: How to improve the efficiency by parallelism?

I would like to parallel the following C++ loop using OpenMP to improve the efficiency. (The value of each element in array2d can be 0 or 1 or 2. The values of array2d are not important for efficiency so I just randomly set each value from 0, 1 and 2. The values in count is initialized to 0.)
int array2d[100][10000];
int count[3][3][3];
//omp_set_num_threads(2);
//#pragma omp parallel for
for (int i = 0; i < 10000; ++i) {
int x = array2d[10][i];
int y = array2d[40][i];
int z = array2d[78][I];
//#pragma omp atomic
count[z][x][y]++;
}
But I cannot get improvements if I use 2, 4, or 8 threads to parallel the loop by #pragma omp parallel for. The execution time of the parallel versions is greater than that of the sequential version. I am curious whether this loop can be improved by OpenMP parallelism? If yes, how can I get a shorter execution time?
If your concern is efficiency, there are other things to do before you try OMP.
Your code is not cache-friendly: the row of 100 ints is 400 bytes, while the cache line is only 64. Since the values are limited to 0..2, single byte (uint8_t) will work better. I would even pack four of those into each byte.
There are 3 effects which can cause that your code is slower in parallel (but I do not know which one is the most impartant in your case):
This code is memory bound, depending on your hardware using more threads may not improve the speed of memory access, therefore the overall speed will not increase.
As already told by #Daniel the workload is very small, so the parallel overhead is big compared to the workload, therefore the runtime will be increased.
As also emphasized by #Daniel count array is small, it has 27 elements only. Continous increment of its elements can cause false sharing, which may reduce the efficiency significantly. You can change it by using reduction (note that in this case you do not need atomic operation, so delete that line):
#pragma omp parallel for reduction(+:count[:3][:3][:3])
If the speed will not increase, unfortunately this code is not worth parallelizing on your hardware. Try to parallelize a bigger part of your program.

Efficient stable sum of a sorted array in AVX2

Consider a sorted (ascending) array of double numbers. For numerical stability the array should be summed up as if iterating it from the beginning till the end, accumulating the sum in some variable.
How to vectorize this efficiently with AVX2?
I've looked into this method Fastest way to do horizontal vector sum with AVX instructions , but it seems quite tricky to scale it to an array (some divide&conquer approach may be needed), while keeping the floating-point precision by ensuring that small numbers are summed up before adding them to a larger number.
Clarification 1: I think it should be ok to e.g. sum the first 4 items, then add them to the sum of the next 4 items, etc. I'm willing to trade some stability for performance. But I would prefer a method that doesn't ruin the stability completely.
Clarification 2: memory shouldn't be a bottleneck because the array is in L3 cache (but not in L1/L2 cache, because pieces of the array were populated from different threads). I wouldn't like to resort to Kahan summation because I think it's really the number of operations that matters, and Kahan summation would increase it about 4 times.
If you need precision and parallelism, use Kahan summation or another error-compensation technique to let you reorder your sum (into SIMD vector element strides with multiple accumulators).
As Twofold fast summation - Evgeny Latkin points out, if you bottleneck on memory bandwidth, an error-compensated sum isn't much slower than a max-performance sum, since the CPU has lots of computation throughput that goes unused in a simply-parallelized sum that bottlenecks on memory bandwidth
See also (google results for kahan summation avx)
https://github.com/rreusser/summation-algorithms
https://scicomp.stackexchange.com/questions/10869/which-algorithm-is-more-accurate-for-computing-the-sum-of-a-sorted-array-of-numb
Is this way of processing the tail of an array with SSE overkill? has a sample implementation of SSE Kahan, not unrolled, and a comparison of actual error with it (no error) vs. sequential sum (bad) vs. simple SIMD sum (much less total error), showing that just vectorizing (and/or unrolling) with multiple accumulators tends to help accuracy.
Re: your idea: Summing groups of 4 numbers in-order will let you hide the FP-add latency, and bottleneck on scalar add throughput.
Doing horizontal sums within vectors takes a lot of shuffling, so it's a potential bottleneck. You might consider loading a0 a1 a2 a3, then shuffling to get a0+a1 x a2+a3 x, then (a0+a1) + (a2+a3). You have a Ryzen, right? The last step is just a vextractf128 down to 128b. That's still 3 total ADD uops, and 3 shuffle uops, but with fewer instructions than scalar or 128b vectors.
Your idea is very similar to Pairwise Summation
You're always going to get some rounding error, but adding numbers of similar magnitude minimizes it.
See also Simd matmul program gives different numerical results for some comments about Pairwise Summation and simple efficient SIMD.
The difference between adding 4 contiguous numbers vs. vertically adding 4 SIMD vectors is probably negligible. SIMD vectors give you small strides (of SIMD vector width) in the array. Unless the array grows extremely quickly, they're still going to have basically similar magnitudes.
You don't need to horizontal sum until the very end to still get most of the benefit. You can maintain 1 or 2 SIMD vector accumulators while you use more SIMD registers to sum short runs (of maybe 4 or 8 SIMD vectors) before adding into the main accumulators.
In fact having your total split more ways (across the SIMD vector elements) means it doesn't grow as large. So it helps with exactly the problem you're trying to avoid, and horizontal summing down to a single scalar accumulator actually makes things worse, especially for a strictly sorted array.
With out-of-order execution, you don't need very many tmp accumulators to make this work and hide the FP-add latency of accumulating into the main accumulators. You can do a couple groups of accumulating into a fresh tmp = _mm_load_ps() vector and adding that to the total, and OoO exec will overlap those executions. So you don't need a huge unroll factor for your main loop.
But it shouldn't be too small, you don't want to bottleneck on the add latency into the main accumulator, waiting for the previous add to produce a result before the next one can start. You want to bottleneck on FP-add throughput. (Or if you care about Broadwell/Haswell and you don't totally bottleneck on memory bandwidth, mix in some FMA with a 1.0 multiplier to take advantage of that throughput.)
e.g. Skylake SIMD FP add has 4 cycle latency, 0.5 cycle throughput, so you need to be doing at least 7 adds that are part of a short dep chain for every add into a single accumulator. Preferably more, and/or preferably with 2 long-term accumulators to better absorb bubbles in scheduling from resource conflicts.
See _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps? for more about multiple accumulators.
Here's my solution so far:
double SumVects(const __m256d* pv, size_t n) {
if(n == 0) return 0.0;
__m256d sum = pv[0];
if(n == 1) {
sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
} else {
for(size_t i=1; i+1 < n; i++) {
sum = _mm256_hadd_pd(sum, pv[i]);
sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
}
sum = _mm256_hadd_pd(sum, pv[n-1]);
}
const __m128d laneSums = _mm_hadd_pd(_mm256_extractf128_pd(sum, 1),
_mm256_castpd256_pd128(sum));
return laneSums.m128d_f64[0] + laneSums.m128d_f64[1];
}
Some explanation: it adds adjacent double array items first, such as a[0]+a[1], a[2]+a[3], etc. Then it adds the sums of adjacent items.
The games you want to play are likely counterproductive. Try experimenting by generating a bunch of iid samples from your favourite distribution, sorting them, and comparing "sum in increasing order" with "sum each lane in increasing order, then sum the lane sums."
For 4 lanes and 16 data, summing lanewise gives me smaller error about 28% of the time while summing in increasing order gives me smaller error about 17% of the time; for 4 lanes and 256 data, summing lanewise gives me smaller error about 68% of the time while summing in increasing order gives me smaller error about 12% of the time. Summing lanewise also beats the algorithm you gave in your self-answer. I used a uniform distribution on [0,1] for this.

How to reduce the overhead of loop when measuring the performance?

When I try to measure the performance of a piece of code, I put it into a loop and iterate for a million time.
for i: 1 -> 1000000
{
"test code"
}
But by using profiling tools, I found that the overhead of the loop is so big that it impacts the performance result significantly, especially when the piece of code is small, say, 1.5s of total elapsed time with 0.5s of loop overhead.
So I'd like to know if there is a better way to test the performance? Or should I stick to this method, but make multiple pieces of the same code under the same loop to increase its weight in the performance?
for i: 1 -> 1000000
{
"test code copy 1"
"test code copy 2"
"test code copy 3"
"test code copy 4"
}
Or is it OK to subtract loop overhead off the total time? Thanks a lot!
You will need to look at the assembly listing generated by the compiler. Count the number of instructions in the overhead.
Usually, for an incrementing loop, the overhead consists of:
Incrementing loop counter.
Brancing to top of loop.
Comparison of counter to limit.
On many processors, these are one processor instruction each or close to that. So find out the average time for an instruction to exit, multiply by the number of instructions in the overhead and that becomes your overhead time for one iteration.
For example, on a processor that averages 100ns per instruction and 3 instructions for the overhead, each iteration uses 3 * (100ns) or 300ns per iteration. Given 1.0E6 iterations, 3.0E08 nanoseconds will be due to overhead. Subtract this quantity from your measurements for a more accurate measurement of the loop's content.

Parallelization of elementwise matrix multiplication

I'm currently optimizing parts of my code and therefore perform some benchmarking.
I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into
!$OMP PARALLEL DO
do j = 1, N
do i = 1, N
A(i,j) = T(i,j) * A(i,j)
end do
end do
!$OMP END PARALLEL DO
(Full minimal working example at http://pastebin.com/RGpwp2KZ.)
The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.
But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).
It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.
You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:
!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE
Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).
Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.

Amortized analysis of std::vector insertion

How do we do the analysis of insertion at the back (push_back) in a std::vector? It's amortized time is O(1) per insertion. In particular in a video in channel9 by Stephan T Lavavej and in this ( 17:42 onwards ) he says that for optimal performance Microsoft's implementation of this method increases capacity of the vector by around 1.5.
How is this constant determined?
Assuming you mean push_back and not insertion, I believe that the important part is the multiply by some constant (as opposed to grabbing N more elements each time) and as long as you do this you'll get amortized constant time. Changing the factor changes the average case and worst case performance.
Concretely:
If your constant factor is too large, you'll have good average case performance, but bad worst case performance especially as the arrays get big. For instance, imagine doubling (2x) a 10000 size vector just because you have the 10001th element pushed. EDIT: As Michael Burr indirectly pointed out, the real cost here is probably that you'll grow your memory much larger than you need it to be. I would add to this that there are cache issues that affect speed if your factor is too large. Suffice it to say that there are real costs (memory and computation) if you grow much larger than you need.
However if your constant factor is too small, say (1.1x) then you're going to have good worst case performance, but bad average performance, because you're going to have to incur the cost of reallocating too many times.
Also, see Jon Skeet's answer to a similar question previously. (Thanks #Bo Persson)
A little more about the analysis: Say you have n items you are pushing back and a multiplication factor of M. Then the number of reallocations will be roughly log base M of n (log_M(n)). And the ith reallocation will cost proportional to M^i (M to the ith power). Then the total time of all the pushbacks will be M^1 + M^2 + ... M^(log_M(n)). The number of pushbacks is n, and thus you get this series (which is a geometric series, and reduces to roughly (nM)/(M-1) in the limit) divided by n. This is roughly a constant, M/(M-1).
For large values of M you will overshoot a lot and allocate much more than you need reasonably often (which I mentioned above). For small values of M (close to 1) this constant M/(M-1) becomes large. This factor directly affects the average time.
You can do the math to try to figure how this kind of thing works.
A popular method to work with asymptotic analysis is the Bankers method. What you do is markup all your operations with an extra cost, "saving" it for later to pay for an expensive operation latter on.
Let's make some dump assumptions to simplify the math:
Writing into an array costs 1. (Same for inserting and moving between arrays)
Allocating a larger array is free.
And our algorithm looks like:
function insert(x){
if n_elements >= maximum array size:
move all elements to a new array that
is K times larger than the current size
add x to array
n_elements += 1
Obviously, the "worst case" happens when we have to move the elements to the new array. Let's try to amortize this by adding a constant markup of d to the insertion cost, bringing it to a total of (1 + d) per operation.
Just after an array has been resized, we have (1/K) of it filled up and no money saved.
By the time we fill the array up, we can be sure to have at least d * (1 - 1/K) * N saved up. Since this money must be able to pay for all N elements being moved, we can figure out a relation between K and d:
d*(1 - 1/K)*N = N
d*(K-1)/K = 1
d = K/(K-1)
A helpful table:
k d 1+d(total insertion cost)
1.0 inf inf
1.1 11.0 12.0
1.5 3.0 4.0
2.0 2.0 3.0
3.0 1.5 2.5
4.0 1.3 2.3
inf 1.0 2.0
So from this you can get a rough mathematician's idea of how the time/memory tradeoff works for this problem. There are some caveats, of course: I didn't go over shrinking the array when it gets less elements, this only covers the worst case where no elements are ever removed and the time costs of allocating extra memory weren't accounted for.
They most likely ran a bunch of experimental tests to figure this out in the end making most of what I wrote irrelevant though.
Uhm, the analysis is really simple when you're familiar with number systems, such as our usual decimal one.
For simplicity, then, assume that each time the current capacity is reached, a new 10x as large buffer is allocated.
If the original buffer has size 1, then the first reallocation copies 1 element, the second (where now the buffer has size 10) copies 10 elements, and so on. So with five reallocations, say, you have 1+10+100+1000+10000 = 11111 element copies performed. Multiply that by 9, and you get 99999; now add 1 and you have 100000 = 10^5. Or in other words, doing that backwards, the number of element copies performed to support those 5 reallocations has been (10^5-1)/9.
And the buffer size after 5 reallocations, 5 multiplications by 10, is 10^5. Which is roughly a factor of 9 larger than the number of element copy operations. Which means that the time spent on copying is roughly linear in the resulting buffer size.
With base 2 instead of 10 you get (2^5-1)/1 = 2^5-1.
And so on for other bases (or factors to increase buffer size by).
Cheers & hth.