Related
sorry for a seemingly stupid question. I was testing the computational efficiency when replacing for-loop operations on matrices with intrinsic functions. When I check the matrices product results of the two methods, it confused me that the two outputs were not the same. Here is the simplified code I used
program matmultest
integer,parameter::nx=64,ny=32,nz=16
real*8::mat1(nx,ny),mat2(ny,nz)
real*8::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real*8::localsum
integer::i,j,m
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
do j=1,nz
do i=1,nx
localsum=0d0
do m=1,ny
localsum=localsum+mat1(i,m)*mat2(m,j)
enddo
result1(i,j)=localsum
enddo
enddo
result2=matmul(mat1,mat2)
diff=result2-result1
print*,sum(abs(diff)),maxval(diff)
end program matmultest
And the result gives
1.6705598682165146E-008 5.8207660913467407E-011
The difference is non-zero for real8 but zero when I tested for integer later. I wonder if it is because of my code's faults somewhere or the numerical accuracy of MATMUL() is single precision?
And the compiler I am using is GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Thanks!
francescalus explained that reordering of operations causes these differences. Let's try to find out how it actually happened.
A few words about matrix product
Consider matrices A(n,p), B(p,q), C(n,q) and C = A*B.
The naive approach, a variant of which you used, involves the following nested loops:
c = 0
do i = 1, n
do j = 1, p
do k = 1, q
c(i, j) = c(i, j) + a(i, k) * b(k, j)
end do
end do
end do
These loops can be executed in any of 6 orders, depending on the variable that you choose at each level. In the example above, the loop is named "ijk", and the other variants "ikj", "jik", etc. are all correct.
There is a speed difference, due to the memory cache: when the inner loop runs across contiguous memory elements, the loop is faster. That's the jki or kji cases.
Indeed, since Fortran matrices are stored in column major order, if the innermost loop runs on i, in the instruction c(i, j) = c(i, j) + a(i, k) * c(k, j), the value c(k, j) is constant, and the operation is equivalent to v(i) = v(i) + x * u(i), where the elements of vectors v and u are contiguous.
However, regarding the order of operations, there shouldn't be a difference: you can check for yourself that all elements of C are computed in the same order. At least at the "higher level": the compiler might optimize things differently, and it's where it becomes really interesting.
What about MATMUL? I believe it's usually a naive matrix product, based on the nested loops above, say a jki loop.
There are other ways to multiply matrices, that involve the Strassen algorithm to improve the algorithm complexity or blocking (i.e. computed products of submatrices) to improve cache use. Other methods that could change the result are OpenMP (i.e. multithread), or using FMA instructions. But here we are not going to delve into these methods. It's really only about the nested loops. If you are interested, there are many resources online, check this.
A few words about optimization
Three remarks first:
On a processor without SIMD instructions, you would get the same result as MATMUL (i.e. you would print zero in the end).
If you had implemented the loops as above, you would also get the same result. There is a tiny but significant difference in your code.
If you had implemented the loops as a subroutine, you would also get the same result. Here I suspect the compiler optimizer is doing some reordering, as I can't reproduce your "accumulator" code with a subroutine, at least with Intel Fortran.
Here is your implementation:
do i = 1, n
do j = 1, p
s = 0
do k = 1, q
s = s + a(i, k) * b(k, j)
end do
c(i, j) = s
end do
end do
It's also correct of course. Here, you are using an accumulator, and at the end of the innermost loop, the value of the accumulator is written in the matrix C.
Optimization is typically relevant on the innermost loop mainly. For our purpose, two "basic" instructions in the innermost loop are relevant, if we get rid of all other details:
v(i) = v(i) + x*u(i) (the jki loop)
s = s + x(k)*y(k) (the accumulator loop where y is contiguous in memory, but not x)
The first is usually called a "daxpy" (from the name of a BLAS routine), for "A X Plus Y", the "D" meaning double precision. The second one is just an accumulator.
On an old sequential processor, there is not much to be done to optimize. On a modern processor with SIMD, registers can hold several values, and computations can be done on all of them at once, in parallel. For instance, on x86, an XMM register (from SSE instruction set) can hold two double precision floating-point numbers. A YMM register (from AVX2) can hold four numbers, and a ZMM register (AVX512, found on Xeon) can hold eight numbers.
For instance, on YMM the innermost loop will be "unrolled" to deal with four vector elements at a time (or even more if using several registers).
Here is what the basic loop block is then roughly doing:
daxpy case:
Read 4 numbers from u into register YMM1
Read 4 numbers from v into register YMM2
x is constant and is kept in another register
Multiply in parallel x with YMM1, add in parallel to YMM2, put the result in YMM2
Write back the result to corresponding elements of v
The read/write part is faster if the elements are contiguous in memory, but if they are not it's still worth doing this in parallel.
Note that here, we haven't changed the execution order of additions of the high level Fortran loop.
accumulator case
For the parallelism to be useful, there will be a trick: accumulate four values in parallel in a YMM register, and then add the four accumulated values.
The basic loop block is thus doing this:
The accumulator is kept in YMM3 (four numbers)
Read 4 numbers from X into register YMM1
Read 4 numbers from Y into register YMM2
Multiply in parallel YMM1 with YMM2, add in parallel to YMM3
At the end of the innermost loop, add the four components of the accumulator, and write this back as the matrix element.
It's like if we had computed:
s1 = x(1)*y(1) + x(5)*y(5) + ... + x(29)*y(29)
s2 = x(2)*y(2) + x(6)*y(6) + ... + x(30)*y(30)
s3 = x(3)*y(3) + x(7)*y(7) + ... + x(31)*y(31)
s4 = x(4)*y(4) + x(8)*y(8) + ... + x(32)*y(32)
And then the matrix element written is c(i,j) = s1+s2+s3+s4.
Here the order of additions has changed! And then, since the order is different, the result is very likely different.
I can replicate the results when using fast math (I have Intel Fortran), and when I compile with the default /fp:fast I get the following max error and speed
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
The error is just maxval(abs(diff)) speed measured is in # of matrix operations per second.
But when I compile with /fp:strict then I get no error, but a slowdown with the loops
! Error Loops Matmul
! 0.0000 43140.6 141844.0 STRICT
I see a -60% slowdown in the loops with strict floating-point handling, but surprisingly no slowdown with the matmul() function.
Source Code for completeness
program Console1
use iso_fortran_env
implicit none
integer,parameter :: nr = 100000
integer,parameter::nx=64,ny=32,nz=16
real(real64)::mat1(nx,ny),mat2(ny,nz)
real(real64)::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real(real64)::localsum
integer::i,j,r
integer(int64) :: tic, toc, rate
real(real64) :: dt1, dt2
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result1=mymatmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt1 = dble(toc-tic)/rate
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result2=matmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt2 = dble(toc-tic)/rate
diff=result2-result1
print ('(1x,a16,1x,a16,1x,a16)'), "Error", "Loops", "Matmul"
print ('(1x,g16.5,1x,f16.1,1x,f16.1)'), maxval(abs(diff)), nr/dt1, nr/dt2
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
! 0.0000 43140.6 141844.0 STRICT
!
contains
pure function mymatmul(a,b) result(c)
real(real64), intent(in) :: a(:,:), b(:,:)
real(real64) :: c(size(a,1), size(b,2))
integer :: i,j,k
real(real64) :: sum
do j=1, size(c,2)
do i=1, size(c,1)
sum = 0d0
do k=1, size(a,2)
sum = sum + a(i,k)*b(k,j)
end do
c(i,j) = sum
end do
end do
end function
end program Console1
Always compiled as Release-x64 and not Debug.
I'm updating a few elements in a large array.
The updates consists of:
multiplying the current value by ten (if it's not zero)
clearing the current value
moving the updated value to a new position in the array
I know there will be no collision when a move occurs.
How can I tell the compiler that it can safely parallelized the loop?
do i = 1, 1e6
if ( v[i] /= 0 ) then
temp = v[i] * 10
v[i] = 0
ndx = get_move_to_ndx(i)
v[ndx] = temp
end if
end do
I'm on ifort, but I guess this is compiler independent.
Here is a mongrel approach, so you have some ideas using temporary vectors. The WHERE may not be correct, you would have to try it. The Main advantage of WHERE/ELSEWHERE is readability, as it is usually not as fast as loops... Just easier to read.
!DIR$ SIMD
FillTemp: Do I = 1, 1000000
Temps(I) = v(I)*10
ENDDO FillTemp
!$OMP PARALLEL DO
FindIndex: Do I = 1, 1000000
ndx_vect(I) = get_move_to_ndx(i)
ENDDO FindIndex
WHERE( Temps /= 0 )
V = 0
ELSEWHERE
v(ndx_Vect) = tempz
ENDWHERE
I have a relatively simple loop where I'm calculating the net acceleration of a system of particles using a brute-force method.
I have a working OpenMP loop which loops over each particles and compares it to every other particles for an n^2 complexity here:
!$omp parallel do private(i) shared(bodyArray, n) default(none)
do i = 1, n
!acc is real(real64), dimension(3)
bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray)
end do
which works just fine.
What I'm trying to do now is to reduce my calculation time by only computing the force on each body once using the fact that the force from F(a->b) = -F(b->a), reducing the number of interactions to calculate by half (n^2 / 2). Which I do in this loop:
call clearAcceleration(bodyArray) !zero out acceleration
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
But I'm having a lot of difficulty with this parallelizing this loop, I keep getting junk results. I think it has to do with this line:
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
and that the forces are not being added up properly with all the different 'j' writing to it.
I've tried using the atomic statement, but that's not allowed on array variables. So then I tried critical, but that increases the time it takes by about 20, and still doesn't give correct results. I also tried adding an ordered statement, but then I just get NaN for all my results.
Is there an easy fix to get this loop working with OpenMP?
Working code, it has a slight speed improvement but not the ~2x I was looking for.
!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided)
do i = 1, n
do j = 1, i-1
forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j))
forces(i, j)%vec = -forces(j, i)%vec
end do
end do
!$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static)
do i = 1, n
do j = 1, n
bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec
end do
end do
With your current approach and data structures you're going to struggle to get good speedup with OpenMP. Consider the loop nest
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
[Actually, before you consider it, revise it as follows ...
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i+1, n
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end do
end do
..., now back to the issues]
There are two problems here:
As you've already twigged, you've got a data race updating bodyArray(j)%acc; multiple threads will try to update the same element and there is no coordination of those updates. Junk results. Using critical sections or ordering the statements serialises the code; when you get it right you also get it as slow as it was before you started with OpenMP.
The pattern of access to elements of bodyArray is cache-unfriendly. It wouldn't surprise me to find that, even if you address the data race without serialising the computation, the impact of the cache-unfriendliness is to produce code slower than the original. Modern CPUs are crazy-fast in computation but the memory systems struggle to feed the beasts so cache effects can be massive. Trying to run two loops over the same rank-1 array simultaneously, which is in essence what your code does, is never (?) going to shift data through cache at maximum speed.
Personally I would try the following. I'm not going to guarantee that this will be faster, but it will be easier (I think) than fixing your current approach and fit OpenMP like a glove. I do have a nagging doubt that this is overcomplicating matters, but I haven't had a better idea yet.
First, create a 2D array of reals, call it forces, where element force(i,j) is the force that element i exerts on j. Then, some code like this (untested, that's your responsibility if you care to follow this line)
forces = 0.0 ! Parallelise this if you want to
!$omp parallel do private(i, j) shared(forces, n) default(none)
do i = 1, n
do j = 1, i-1
forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly
end do
end do
then sum the forces on each particle (and get the following right, I haven't checked carefully)
!$omp parallel do private(i) shared(bodyArray,forces, n) default(none)
do i = 1, n
bodyArray(i)%acc = sum(forces(:,i))
end do
As I wrote above, computation is extremely fast and if you have the memory to spare it's often well worth trading some space for time.
Now what you have is, probably, a problem with load balancing in the loop nest over forces. Most OpenMP implementations will, by default, perform a static distribution of work (this is not required by the standard but seems to be most common, check your documentation). So thread 1 will get the first n/num_threads rows to deal with, but these are the itty-bitty little rows at the top of the triangle you're computing. Thread 2 will get more work, thread 3 still more, and so forth. You might get away with simply adding a schedule(dynamic) clause to the parallel directive, you might have to work a bit harder to balance the load.
You may also want to review my code snippets wrt cache-friendliness and adjust as appropriate. And you may well find, if you do as I suggest, that you were better off with your original code, that halving the amount of computation doesn't actually save much time.
Another approach would be to pack the lower (or upper) triangle of forces into a rank-1 array and use some fancy indexing arithmetic to transform 2D (i,j) indices into a 1D index into that array. This would save storage space, and might be easier to make cache-friendly.
I am very confused about this problem regarding openmp in fortran. Specifically, when I write the program like this:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
i=2
j=2
l=41
!$OMP PARALLEL SHARED(M),PRIVATE(l,i,j)
!$OMP DO
DO i=1,2
DO j=1,2
DO l=0,41
M(i,j)=M(i,j)+1
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END PROGRAM TEST
After compiling by: ifort -openmp test.f90, it works well, and the results of M(1,1) is 42 as expected.
However, when I only adjust the order of sum over l and {i,j}, like the following:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
i=2
j=2
l=41
!$OMP PARALLEL SHARED(M),PRIVATE(l,i,j)
!$OMP DO
DO l=0,41
DO i=1,2
DO j=1,2
M(i,j)=M(i,j)+1
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END PROGRAM TEST
After compiling by: ifort -openmp test.f90, it doesn't work well. In fact, when you run a.out several times, the results of M(1,1) seems to be random. Does anyone know what's the problem? Also, if I want to obtain the right results, under the summing order:
DO l=0,41
DO i=1,2
DO j=1,2
what part should I modify this code?
Many thanks for any help.
You have a race condition. Threads with different l are trying to use the same element M(i,j). You can use tools like Intel Inspector or Oracle Thread Analyzer to find it (I checked with Intel). The best thing to do is using your original order. You can also use reduction, but be careful with larger arrays:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
M = 0
!$OMP PARALLEL DO PRIVATE(l,i,j),reduction(+:M)
DO l = 0, 41
DO i = 1, 2
DO j = 1, 2
M(i,j) = M(i,j) + 1
END DO
END DO
END DO
!$OMP END PARALLEL DO
print *, M
END PROGRAM
There are many problems with your approach. First of all, the missing initialization of your array M. Inside your loop, you issue
M(i,j) = M(i,j) + 1
without having given any initial value to M(i,j). So the algorithm is indeterministic even in the serial case, and it is just a matter of lack, that you obtain the right result with any specific compiler or any specific summation order.
Addintionally, if you parallelize the loop over l, like
!$OMP PARALLEL DO SHARED(M),PRIVATE(l,i,j)
DO l = 0, 41
DO i = 1, 2
DO j = 1, 2
M(i,j) = M(i,j) + 1
END DO
END DO
END DO
every thread will have an own nested loop construct over i and j covering all matrix elements. Consequently, different threads will access the same elements of the matrix at the same time. The result again being indeterministic. You could of course, try to solve the issue by ensuring via OpenMP constructs, that the threads wait on each other before accessing a certain matrix element. However, that would make the algorithm definitely too slow. The best you can do in this case, in my oppinion, to parallelize over the matrix elements (the loops over i and j).
By the way, the lines
i=2
j=2
l=41
in your code are superfluous, since you immediately use them as loop variables so that their will be overwritten anyway.
What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?
Please assume the following:
The size of the entire set is at maximum 4096 items
The size of the subsets is open to discussion, but let us assume between 16-256 initially
All data used through the merge should preferably fit into L1
The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
All data is 16-byte aligned
We want to try to minimize branching (for obvious reasons)
Main criteria of feasibility: faster than an in-L1 LSD radix sort.
I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)
Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)
//4x sorted subsets
data[4][4] = {
{3, 4, 5, INF},
{2, 7, 8, INF},
{1, 4, 4, INF},
{5, 8, 9, INF}
}
data_offset[4] = {0, 0, 0, 0}
n = 4*3
for(i=0, i<n, i++):
sub = 0
sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])
out[i] = data[sub][data_offset[sub]]
data_offset[sub]++
Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.
Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)
//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)
Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:
max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])
q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]
sub = max1*q + ((~max2)&1)*q
Edit 4:
Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator:
sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub
Edit 5:
In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare.
Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions.
This works due to the operators < > == yielding the same results when interpreting a float on the binary level.
Edit 6:
If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared.
At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it.
Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA's.
This is more of a research topic, but I did find this paper which discusses minimizing branch mispredictions using d-way merge sort.
SIMD sorting algorithms have already been studied in detail. The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more).
The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo. To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y, with nx+ny == k. But z[1] cannot contain elements from both x[1] and y[1], because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. The one with the lower first element will necessarily appear first in z, so this is simply done by comparing their first element. And we just repeat that until there is no more data to merge.
Pseudo-code, assuming the arrays end with a +inf value:
a := *x++
b := *y++
while not finished:
lo,hi := merge(a,b)
*z++ := lo
a := hi
if *x[0] <= *y[0]:
b := *x++
else:
b := *y++
(note how similar this is to the usual scalar implementation of merging)
The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++.
merge itself can be implemented with a bitonic sort. But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. See the paper for more details.
Edit: Below is a diagram when k = 4. All asymptotics assume that k is fixed.
The big gray box is merging two arrays of size n = m * k (in the picture, m = 3).
We operate on blocks of size k.
The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks.
Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n.
So the time cost of merging is linear in n. And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging.
Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. Each level has complexity linear in the number of elements, so the total complexity is O(n log (n / n0)) with n0 the initial size of the sorted arrays and n is the size of the final array.
The most obvious answer that comes to mind is a standard N-way merge using a heap. That'll be O(N log k). The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N.
Cache behavior should be ... reasonable, although not perfect. The heap, where most of the action is, will probably remain in the cache throughout. The part of the output array being written to will also most likely be in the cache.
What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. Sounds like a problem, but perhaps it isn't. The data that will most likely be swapped out is the front of the output array after the insertion point has moved. Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache.
You can merge int arrays (expensive) branch free.
typedef unsigned uint;
typedef uint* uint_ptr;
void merge(uint*in1_begin, uint*in1_end, uint*in2_begin, uint*in2_end, uint*out){
int_ptr in [] = {in1_begin, in2_begin};
int_ptr in_end [] = {in1_end, in2_end};
// the loop branch is cheap because it is easy predictable
while(in[0] != in_end[0] && in[1] != in_end[1]){
int i = (*in[0] - *in[1]) >> 31;
*out = *in[i];
++out;
++in[i];
}
// copy the remaining stuff ...
}
Note that (*in[0] - *in[1]) >> 31 is equivalent to *in[0] - *in[1] < 0 which is equivalent to *in[0] < *in[1]. The reason I wrote it down using the bitshift trick instead of
int i = *in[0] < *in[1];
is that not all compilers generate branch free code for the < version.
Unfortunately you are using floats instead of ints which at first seems like a showstopper because I do not see how to realabily implement *in[0] < *in[1] branch free. However, on most modern architectures you interprete the bitpatterns of positive floats (that also are no NANs, INFs or such strange things) as ints and compare them using < and you will still get the correct result. Perhaps you extend this observation to arbitrary floats.
You could do a simple merge kernel to merge K lists:
float *input[K];
float *output;
while (true) {
float min = *input[0];
int min_idx = 0;
for (int i = 1; i < K; i++) {
float v = *input[i];
if (v < min) {
min = v; // do with cmov
min_idx = i; // do with cmov
}
}
if (min == SENTINEL) break;
*output++ = min;
input[min_idx]++;
}
There's no heap, so it is pretty simple. The bad part is that it is O(NK), which can be bad if K is large (unlike the heap implementation which is O(N log K)). So then you just pick a maximum K (4 or 8 might be good, then you can unroll the inner loop), and do larger K by cascading merges (handle K=64 by doing 8-way merges of groups of lists, then an 8-way merge of the results).