I want to generate primes by sieving up to 100,000,000 but declaring a bool array of this range is crashing my program.
This is my code:
long long i,j,n;
bool prime[100000000+1];
prime[1]=prime[0]=false;
for(i=2;i<=100000000;i++){
prime[i]=true;
}
for(i=2;i<=100000000;i++){
if(prime[i]==false){
continue;
}
for(j=i*2;j<=100000000;j+=i){
prime[j]=false;
}
}
How can I solve this problem?
The size of the array prime is 100 MB and declaring so big array on the stack is not allowed. Try placing the array in global scope to allocated it on the heap or alternatively allocate it using new(in C++) or malloc(in C). Don't forget to free the memory after that!
Variables can be stored in three different memory areas: static memory, automatic memory, dynamic memory. Automatic memory (non-static local variables) has limited size, you crossed it, and this crashed the program. The alternative is to mark your array static, which will place your array in static storage, or use dynamic memory.
Since this is tagged C++...
Use std::vector which is simple to use and uses dynamic memory.
#include <vector>
//...
//...
long long i,j,n;
std::vector<bool> prime(100000000+1, true);
prime[1]=prime[0]=false;
for(i=2;i<=100000000;i++){
if(prime[i]==false){
continue;
}
for(j=i*2;j<=100000000;j+=i){
prime[j]=false;
}
}
std::vector<bool> uses "bit-efficient" representation, which means that std::vector here will take about eight1 times less memory than traditional array.
std::bitset is similiar, but is constant in size, and you have to mark it static to avoid taking space in automatic memory.
You haven't asked, but Erastostenes Sieve is not the fastest algorithm for calculating a lot of prime numbers. It seems that Sieve of Atkin is faster and uses less memory.
1 - When your system has 8-bit bytes.
You should not make a single monolithic sieve of that size. Instead, use a segmented Sieve of Eratosthenes to perform the sieving in successive segments. At the first segment, the smallest multiple of each sieving prime that is within the segment is calculated, then multiples of the sieving primes are marked composite in the normal way; when all the sieving primes have been used, the remaining unmarked number in the segment are prime. Then, for the next segment, the smallest multiple of each sieving prime is the multiple that ended the sieving in the prior segment, and so the sieving continues until finished.
Consider the example of sieving from 100 to 200 in segments of 20; the 5 sieving primes are 3, 5, 7, 11 and 13. In the first segment from 100 to 120, the bitarray has 10 slots, with slot 0 corresponding to 101, slot k corresponding to 100 + 2*k* + 1, and slot 9 corresponding to 119. The smallest multiple of 3 in the segment is 105, corresponding to slot 2; slots 2+3=5 and 5+3=8 are also multiples of 3. The smallest multiple of 5 is 105 at slot 2, and slot 2+5=7 is also a multiple of 5. The smallest multiple of 7 is 105 at slot 2, and slot 2+7=9 is also a multiple of 7. And so on.
Function primes takes arguments lo, hi and delta; lo and hi must be even, with lo < hi, and lo must be greater than the square root of hi. The segment size is twice delta. Array ps of length m contains the sieving primes less than the square root of hi, with 2 removed since even numbers are ignored, calculated by the normal Sieve of Eratosthenes. Array qs contains the offset into the sieve bitarray of the smallest multiple in the current segment of the corresponding sieving prime. After each segment, lo advances by twice delta, so the number corresponding to an index i of the sieve bitarray is lo + 2 i + 1.
function primes(lo, hi, delta)
sieve := makeArray(0..delta-1)
ps := tail(primes(sqrt(hi)))
m := length(ps)
qs := makeArray(0..m-1)
for i from 0 to m-1
qs[i] := (-1/2 * (lo + ps[i] + 1)) % ps[i]
while lo < hi
for i from 0 to delta-1
sieve[i] := True
for i from 0 to m-1
for j from qs[i] to delta step ps[i]
sieve[j] := False
qs[i] := (qs[i] - delta) % ps[i]
for i from 0 to delta-1
t := lo + 2*i + 1
if sieve[i] and t < hi
output t
lo := lo + 2*delta
For the sample given above, this is called as primes(100, 200, 10). In the example given above, qs is initially [2,2,2,10,8], corresponding to smallest multiples 105, 105, 105, 121 and 117, and is reset for the second segment to [1,2,6,0,11], corresponding to smallest multiples 123, 125, 133, 121 and 143.
The value of delta is critical; you should make delta as large as possible so long at it fits in cache memory, for speed. Use your language's library for the bitarray, so that you only take a single bit for each sieve location. If you need a simple Sieve of Eratosthenes to calculate the sieving primes, this is my favorite:
function primes(n)
sieve := makeArray(2..n, True)
for p from 2 to n step 1
if sieve(p)
output p
for i from p * p to n step p
sieve[i] := False
You can see more algorithms involving prime numbers at my blog.
Related
I have been solving a problem but then got stuck upon its subpart which is as follows:
Given an array of N elements whose ith element is A[i] and we are given Q queries of the type [L,R].
For each query output the number of divisors of product from Lth element to Rth element.
More formally, for each query lets define P as P = A[L] * A[L+1] * A[L+2] * ...* A[R].
Output the number of divisors of P modulo 998244353.
Constraints : 1<= N,Q <= 100000, 1<= A[i] <= 1000000.
My Approach,
For each index i, I have defined a map< int, int > which stores the prime divisor and its count in the product up to [1, i].
I am extracting the prime divisors of a number in O(LogN) using Sieve.
Then for each query (lets say {L,R} ), I am iterating through the map of Lth element and subtracting the count of each each key from the map of Rth element.
And then I am answering the query using the result:
if N = a^p * b^q * c^r ...(a,b,c being primes)
the number of divisors = (p+1)(q+1)(r+1)..
The time complexity of above solution is O(ND + QD), where D = number of distinct prime numbers upto 1000000. In worst case D = 78498.
Is there more efficient solution than this?
There is a more efficient solution for this. But it is slightly complicated. Here are steps to get to the necessary data structure.
Define a data type prime_factor that is a struct that contains a prime and a count.
Define a data type prime_factorization that is a vector of the first data type in ascending size of the primes. This can store the factorization of a number.
Write a function that takes a number, and turns its prime factorization into a prime_factorization
Write a function that takes 2 prime_factorization vectors and merges them into the factorization of the product of the two.
For each number in your array, compute its prime factorization. That gets stored in an array.
For each pair in your array, compute the prime factorization of the product. We will only need half of them. So elements 0, 1 go into one factorization, 2, 3 into the next and so on.
Repeat step 6 O(log(N)) times. So you have a vector of the factorization of each number, pairs, fours, eights, and so on. This results in approximately 2N precomputed factorization vectors. Most vectors are small though a few can be up to O(D) in size (where D is the number of distinct primes). Most of the merges should be very, very fast.
And now you have all of your data prepared. It can't take more than O(log(N)) times the space that storing the prime factors required by itself. (Less than that normally, though, because repeats among the small primes get gathered together in one prime_factor.)
Any range is the union of at most O(log(N)) of these computed vectors. For example the range 10..25 can be broken up into 10..11, 12..15, 16..24, 25. Arrange these intervals from smallest to largest and merge them. Then compute your answer from the result.
An exact analysis is complicated. But I assure you that query time is bounded above by O(Q * D * log(N)) and normally is much less than that.
UPDATE:
How do you find those intervals?
The answer is that you need to identify the number divisible by the highest power of 2 in the range, and then fill out both sides from there. And you figure that out by dividing by 2 (rounding down) until the range is of length 1. Then multiply the top boundary by 2 to find that mid-point.
For example if your range was 35-53 you would start by dividing by 2 to get 35-53, 17-26, 8-13, 4-6, 2-3. That was 2^4 we divided by. our power of 2 mid-point is 3*2^4 = 48. Our intervals above that midpoint are then 48-52, 53-53. Our intervals below are 40-47, 36-39, 35-35. And each of them is of length a power of 2 and starts at a number divisible by that power of 2.
my teacher gave me this :
n<=10^6;
an array of n integer :ai..an(ai<=10^9);
find all prime numbers .
he said something about sieve of eratosthenes,and I read about it,also the wheel factorization too,but I still couldn't figure it out how to get the program (fpc) to run in 1s.??
as I know it's impossible,but still want to know your opinion .
and with the wheel factorization ,a 2*3 circle will treat 25 as a prime number,and I wanna ask if there is a way to find out the first number of the wheel treated wrong as a prime number.
example:2*3*5 circle ,how to find the first composite number treated as aprime number??
please help..and sorry for bad english.
A proper Sieve of Eratosthenes should find the primes less than a billion in about a second; it's possible. If you show us your code, we'll be happy to help you find what is wrong.
The smallest composite not marked by a 2,3,5-wheel is 49: the next largest prime not a member of the wheel is 7, and 7 * 7 = 49.
I did it now and it's finding primes up to 1000000 in a few milliseconds, without displaying all those numbers.
Declare an array a of n + 1 bools (if it is zero-based). At the beginning 0th and 1st element are false, all others are true (false is not a prime).
The algorithm looks like that:
i = 2;
while i * i <= n
if a[i] == true
j = i * i;
while j < n
a[j] = false;
j = j + i;
i = i + 1;
In a loop the condition is i * i <= n because you start searching from i * i (smaller primes than that was found already by one of other primes) so square root of i must not be bigger than n. You remove all numbers which are multiplies of primes up to n.
Time complexity is O(n log log n).
If you want to display primes, you display indexes which values in array are true.
Factorization is usefull if you want to find e.g. all semiprimes from 0 to n (products of two prime numbers). Then you find all smallest prime divisors from 0 to n/2 and check for each number if it has prime divisor and if number divided by its prime divisor has zero divisors. If so - it is a semiprime. My program wrote like that was calculating 8 times faster than first finding all primes and then multiplying them and saving result in an array.
Given a fixed array A of N integers where N<=100,000 and all elements of array are also less than or equal to 100,000. The numbers in A are not monotonically increasing or contiguous or otherwise conveniently organized.
Now I am given up to 100,000 queries of the form {V, L, R} where in each query I need to find the largest number A[i] with i in the range [L,R] that is not coprime with the given value V. (That is GCD(V,A[i]) is not equal to 1.)
If it's is not possible, then also tell that all numbers in the given range are coprime to V.
A basic approach would be to iterate from each A[i] between L and R and compute GCD with value V and hence find maximum. But is there any better way to do it if the number of queries can be up to 100,000 too. In that case, it's too inefficient to check for each number each time.
Example:
Let us have N=6 and the array be [1,2,3,4,5,4] and let V be 2 and range [L,R] is [2,5].
Then the answer is 4.
Explanation:
GCD(2,2)=2
GCD(2,3)=1
GCD(2,4)=2
GCD(2,5)=1
So maximum is 4 here.
Since you have a large array but only one V, it should be faster to start by factorizing V. After that your coprime test becomes simply finding the remainder modulo each unique factor of V.
Daniel Bernstein's "Factoring into coprimes in essentially linear time" (Journal of Algorithms 54:1, 1-30 (2005)) answers a similar question, and is used to identify bad (repeat factor) RSA moduli by Nadia Heninger's "New research: There's No Need to Panic Over Factorable Keys--Just Mind Your Ps and Qs"`. The problem there is to find common factors between a huge set of very large numbers, without going a pair at a time.
Lets say that
V = p_1*...*p_n
where p_i is a prime number (you can restrict it to distinct primes only). Now the answer is
result = -1
for p_i:
res = floor(R / p_i) * p_i
if res >= L and res > result:
result = res
So if you can factorize V fast then this will be quite efficient.
EDIT I didn't notice that the array does not have to contain all integers. In that case sieve it, i.e. given prime numbers p_1, ..., p_n create a "reversed" sieve (i.e. all multiples of primes in range [L, R]). Then you can just do an intersection of that sieve with your initial array.
EDIT2 To generate the set of all multiples you can use this algorithm:
primes = [p_1, ..., p_n]
multiples = []
for p in primes:
lower = floor(L / p)
upper = floor(R / p)
for i in [lower+1, upper]:
multiples.append(i*p)
The imporatant thing is that it follows from math that V is coprime with every number in range [L, R] which is not in multiples. Now you simply do:
solution = -1
for no in initial_array:
if no in multiples:
solution = max(solution, no)
Note that if you implement result as a set, then if no in result: check is O(1).
EXAMPLE Let's say that V = 6 = 2*3 and initial_array = [7,11,12,17,21] and L=10 and R=22. Let's start with multiples. Following the algorithm we obtain that
multiples = [10, 12, 14, 16, 18, 20, 22, 12, 15, 18, 21]
First 7 are multiples of 2 (in range [10, 22]) and last 4 are multiples of 3 (in range [10, 22]). Since we are dealing with sets (std::set?) then there will be no duplicates (12 and 18):
multiples = [10, 12, 14, 16, 18, 20, 22, 15, 21]
Now go through the initial_array and check what values are in multiples. We obtain that the biggest such number is 21. And indeed 21 is not coprime with 6.
Factor each of A's elements and store, for each possible prime factor, a sorted list of the numbers that contain this factor.
Given a number n contains O(log n) prime factors, this list will use O(N log N) memory.
Then, for each query (V, L, R), search for each prime factor in V, what is the maximum number that contain that factor within [L, R] (this can be done with a simple binary search).
I was asked a question for a job interview and I did not know the correct answer....
The question was:
If you have an array of 10 000 000 ints between 1 and 100, determine (efficiently) how many pairs of these ints sum up to 150 or less.
I don't know how to do this without a loop within a loop, but that is not very efficient.
Does anyone please have some pointers for me?
One way is by creating a smaller array of 100 elements. Loop through the 10,000,000 elements and count how many of each. Store the counter in the 100 element array.
// create an array counter of 101 elements and set every element to 0
for (int i = 0; i < 10000000; i++) {
counter[input[i]]++;
}
then do a second loop j from 1 to 100. inside that, have a loop k from 1 to min(150-j,j). if k!=j, add counter[j]*counter[k]. if k=j, add (counter[j]-1)*counter[j].
the total sum is your result.
Your total run time is bounded on the top by 10,000,000 + 100*100 = 10,010,000 (it's actually smaller than this).
This is a lot faster than (10,000,000)^2, which is 100,000,000,000,000.
Of course, you have to give up 101 int space in memory.
Delete counter when you're done.
Note also (as pointed out in the discussion below) that this is assuming that order matters. If order doesn't matter, just divide the result by 2.
first, I would sort the array. Then you start a single pass through the sorted array. You get the single value n in that cell and find the correspondent lowest value that is still allowed (e.g. for 15 it is 135). Now you find the index of this value in the array and that's the amount of pairs for n. Sum up all these and you have (if my mind is working correctly) counted each pair twice, so if you divide the sum by 2, you have the correct number.
The solution should be O(n log n) compared to the trivial one, which is O(n^2)
These kind of questions always require a mixture of mathematical insight and efficient programming. They don't want brute force.
First Insight
Numbers can be grouped according to how they will pair with other groups.
Putting them into:
1 - 50 | 51 - 75 | 76 - 100
A | B | C
Group A can pair with anything.
Group B can pair with A and B, and possibly C
Group C can pair with A and possibly B, but not C
The possibly is where we need some more insight.
Second Insight
For each number in B we need to check how many numbers there are up to its complement with 150. For example, with 62 from group B we want to know from group C how many numbers are less than or equal to 88.
For each number in C we add up the tallies up to it, e.g. tallies for 76, 77, 78, ..., 88. This is known mathematically as the partial sum.
In the standard library there is a function which produces a partial_sum
vector<int> tallies(25); // this is room for the tallies from C
vector<int> partial_sums(25);
partial_sum(tallies.begin(), tallies.end(), partial_sums.begin());
Symmetry means this sum only needs to be done for one group.
Third (much later) insight
Calculating the totals for group A and B can be done using partial_sum, too. So rather than only calculating for group C, and having to track the totals some other way, just store the totals for each number from 1 to 100, and then create the partial_sum over the whole thing. partial_sums[50] will give you the amount of numbers less than or equal to 50, partial_sums[75] those less than or equal to 75, and partial_sums[100] should be 10 million, i.e. all the numbers less than or equal to 100.
Finally we can calculate the combinations from B and C. We want to add together all the multiples of totals for 50 and 100, 51 and 99, 52 and 98, etc. we can do this by iterating through the tallies from 50 to 75 and the partial_sums from 100 to 75. There is a standard library function inner_product which can handle this.
This seems quite linear to me.
random_device rd;
mt19937 gen(rd());
uniform_int_distribution<> dis(1, 100);
vector<int> tallies(100);
for(int i=0; i < 10000000; ++i) {
tallies[dis(gen)]++;
}
vector<int> partial_sums(100);
partial_sum(tallies.begin(), tallies.end(), partial_sums.begin());
int A = partial_sums[50];
int AB = partial_sums[75];
int ABC = partial_sums[100];
int B = AB - A;
int C = ABC - AB;
int A_match = A * ABC;
int B_match = B * B;
int C_match = inner_product(&tallies[50], &tallies[75],
partial_sums.rend(), 0);
What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?
Please assume the following:
The size of the entire set is at maximum 4096 items
The size of the subsets is open to discussion, but let us assume between 16-256 initially
All data used through the merge should preferably fit into L1
The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
All data is 16-byte aligned
We want to try to minimize branching (for obvious reasons)
Main criteria of feasibility: faster than an in-L1 LSD radix sort.
I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)
Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)
//4x sorted subsets
data[4][4] = {
{3, 4, 5, INF},
{2, 7, 8, INF},
{1, 4, 4, INF},
{5, 8, 9, INF}
}
data_offset[4] = {0, 0, 0, 0}
n = 4*3
for(i=0, i<n, i++):
sub = 0
sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])
out[i] = data[sub][data_offset[sub]]
data_offset[sub]++
Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.
Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)
//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)
Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:
max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])
q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]
sub = max1*q + ((~max2)&1)*q
Edit 4:
Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator:
sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub
Edit 5:
In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare.
Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions.
This works due to the operators < > == yielding the same results when interpreting a float on the binary level.
Edit 6:
If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared.
At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it.
Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA's.
This is more of a research topic, but I did find this paper which discusses minimizing branch mispredictions using d-way merge sort.
SIMD sorting algorithms have already been studied in detail. The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more).
The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo. To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y, with nx+ny == k. But z[1] cannot contain elements from both x[1] and y[1], because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. The one with the lower first element will necessarily appear first in z, so this is simply done by comparing their first element. And we just repeat that until there is no more data to merge.
Pseudo-code, assuming the arrays end with a +inf value:
a := *x++
b := *y++
while not finished:
lo,hi := merge(a,b)
*z++ := lo
a := hi
if *x[0] <= *y[0]:
b := *x++
else:
b := *y++
(note how similar this is to the usual scalar implementation of merging)
The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++.
merge itself can be implemented with a bitonic sort. But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. See the paper for more details.
Edit: Below is a diagram when k = 4. All asymptotics assume that k is fixed.
The big gray box is merging two arrays of size n = m * k (in the picture, m = 3).
We operate on blocks of size k.
The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks.
Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n.
So the time cost of merging is linear in n. And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging.
Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. Each level has complexity linear in the number of elements, so the total complexity is O(n log (n / n0)) with n0 the initial size of the sorted arrays and n is the size of the final array.
The most obvious answer that comes to mind is a standard N-way merge using a heap. That'll be O(N log k). The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N.
Cache behavior should be ... reasonable, although not perfect. The heap, where most of the action is, will probably remain in the cache throughout. The part of the output array being written to will also most likely be in the cache.
What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. Sounds like a problem, but perhaps it isn't. The data that will most likely be swapped out is the front of the output array after the insertion point has moved. Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache.
You can merge int arrays (expensive) branch free.
typedef unsigned uint;
typedef uint* uint_ptr;
void merge(uint*in1_begin, uint*in1_end, uint*in2_begin, uint*in2_end, uint*out){
int_ptr in [] = {in1_begin, in2_begin};
int_ptr in_end [] = {in1_end, in2_end};
// the loop branch is cheap because it is easy predictable
while(in[0] != in_end[0] && in[1] != in_end[1]){
int i = (*in[0] - *in[1]) >> 31;
*out = *in[i];
++out;
++in[i];
}
// copy the remaining stuff ...
}
Note that (*in[0] - *in[1]) >> 31 is equivalent to *in[0] - *in[1] < 0 which is equivalent to *in[0] < *in[1]. The reason I wrote it down using the bitshift trick instead of
int i = *in[0] < *in[1];
is that not all compilers generate branch free code for the < version.
Unfortunately you are using floats instead of ints which at first seems like a showstopper because I do not see how to realabily implement *in[0] < *in[1] branch free. However, on most modern architectures you interprete the bitpatterns of positive floats (that also are no NANs, INFs or such strange things) as ints and compare them using < and you will still get the correct result. Perhaps you extend this observation to arbitrary floats.
You could do a simple merge kernel to merge K lists:
float *input[K];
float *output;
while (true) {
float min = *input[0];
int min_idx = 0;
for (int i = 1; i < K; i++) {
float v = *input[i];
if (v < min) {
min = v; // do with cmov
min_idx = i; // do with cmov
}
}
if (min == SENTINEL) break;
*output++ = min;
input[min_idx]++;
}
There's no heap, so it is pretty simple. The bad part is that it is O(NK), which can be bad if K is large (unlike the heap implementation which is O(N log K)). So then you just pick a maximum K (4 or 8 might be good, then you can unroll the inner loop), and do larger K by cascading merges (handle K=64 by doing 8-way merges of groups of lists, then an 8-way merge of the results).