Trying to understand this solution - c++

I was trying to solve a question and I got into a few obstacles that I failed to solve, starting off here is the question: Codeforces - 817D
Now I tried to brute force it, using a basic get min and max for each segment of the array I could generate and then keeping track of them I subtract them and add them together to get the final imbalance, this looked good but it gave me a time limit exceeded cause brute forcing n*(n+1)/2 subsegments of the array given n is 10^6 , so I just failed to go around it and after like a couple of hours of not getting any new ideas I decided to see a solution that I could not understand anything in to be honest :/ , here is the solution:
#include <bits/stdc++.h>
#define ll long long
int a[1000000], l[1000000], r[1000000];
int main(void) {
int i, j, n;
scanf("%d",&n);
for(i = 0; i < n; i++) scanf("%d",&a[i]);
ll ans = 0;
for(j = 0; j < 2; j++) {
vector<pair<int,int>> v;
v.push_back({-1,INF});
for(i = 0; i < n; i++) {
while (v.back().second <= a[i]) v.pop_back();
l[i] = v.back().first;
v.push_back({i,a[i]});
}
v.clear();
v.push_back({n,INF});
for(i = n-1; i >= 0; i--) {
while (v.back().second < a[i]) v.pop_back();
r[i] = v.back().first;
v.push_back({i,a[i]});
}
for(i = 0; i < n; i++) ans += (ll) a[i] * (i-l[i]) * (r[i]-i);
for(i = 0; i < n; i++) a[i] *= -1;
}
cout << ans;
}
I tried tracing it but I keep wondering why was the vector used , the only idea I got is he wanted to use the vector as a stack since they both act the same(Almost) but then the fact that I don't even know why we needed a stack here and this equation ans += (ll) a[i] * (i-l[i]) * (r[i]-i); is really confusing me because I don't get where did it come from.

Well thats a beast of a calculation. I must confess, that i don't understand it completely either. The problem with the brute force solution is, that you have to calculate values or all over again.
In a slightly modified example, you calculate the following values for an input of 2, 4, 1 (i reordered it by "distance")
[2, *, *] (from index 0 to index 0), imbalance value is 0; i_min = 0, i_max = 0
[*, 4, *] (from index 1 to index 1), imbalance value is 0; i_min = 1, i_max = 1
[*, *, 1] (from index 2 to index 2), imbalance value is 0; i_min = 2, i_max = 2
[2, 4, *] (from index 0 to index 1), imbalance value is 2; i_min = 0, i_max = 1
[*, 4, 1] (from index 1 to index 2), imbalance value is 3; i_min = 2, i_max = 1
[2, 4, 1] (from index 0 to index 2), imbalance value is 3; i_min = 2, i_max = 1
where i_min and i_max are the indices of the element with the minimum and maximum value.
For a better visual understanding, i wrote the complete array, but hid the unused values with *
So in the last case [2, 4, 1], brute-force looks for the minimum value over all values, which is not necessary, because you already calculated the values for a sub-space of the problem, by calculating [2,4] and [4,1]. But comparing only the values is not enough, you also need to keep track of the indices of the minimum and maximum element, because those can be reused in the next step, when calculating [2, 4, 1].
The idead behind this is a concept called dynamic programming, where results from a calculation are stored to be used again. As often, you have to choose between speed and memory consumption.
So to come back to your question, here is what i understood :
the arrays l and r are used to store the indices of the greatest number left or right of the current one
vector v is used to find the last number (and it's index) that is greater than the current one (a[i]). It keeps track of rising number series, e.g. for the input 5,3,4 at first the 5 is stored, then the 3 and when the 4 comes, the 3 is popped but the index of 5 is needed (to be stored in l[2])
then there is this fancy calculation (ans += (ll) a[i] * (i-l[i]) * (r[i]-i)). The stored indices of the maximum (and in the second run the minimum) elements are calculated together with the value a[i] which does not make much sense for me by now, but seems to work (sorry).
at last, all values in the array a are multiplied by -1, which means, the old maximums are now the minimums, and the calculation is done again (2nd run of the outer for-loop over j)
This last step (multiply a by -1) and the outer for-loop over j are not necessary but it's an elegant way to reuse the code.
Hope this helps a bit.

Related

Intuition behind using a monotonic stack

I am solving a question on LeetCode.com:
Given an array of integers A, find the sum of min(B), where B ranges over every (contiguous) subarray of A. Since the answer may be large, return the answer modulo 10^9 + 7.
Input: [3,1,2,4]
Output: 17
Explanation: Subarrays are [3], [1], [2], [4], [3,1], [1,2], [2,4], [3,1,2], [1,2,4], [3,1,2,4]. Minimums are 3, 1, 2, 4, 1, 1, 2, 1, 1, 1. Sum is 17.
A highly upvoted solution is as below:
class Solution {
public:
int sumSubarrayMins(vector<int>& A) {
stack<pair<int, int>> in_stk_p, in_stk_n;
// left is for the distance to previous less element
// right is for the distance to next less element
vector<int> left(A.size()), right(A.size());
//initialize
for(int i = 0; i < A.size(); i++) left[i] = i + 1;
for(int i = 0; i < A.size(); i++) right[i] = A.size() - i;
for(int i = 0; i < A.size(); i++){
// for previous less
while(!in_stk_p.empty() && in_stk_p.top().first > A[i]) in_stk_p.pop();
left[i] = in_stk_p.empty()? i + 1: i - in_stk_p.top().second;
in_stk_p.push({A[i],i});
// for next less
while(!in_stk_n.empty() && in_stk_n.top().first > A[i]){
auto x = in_stk_n.top();in_stk_n.pop();
right[x.second] = i - x.second;
}
in_stk_n.push({A[i], i});
}
int ans = 0, mod = 1e9 +7;
for(int i = 0; i < A.size(); i++){
ans = (ans + A[i]*left[i]*right[i])%mod;
}
return ans;
}
};
My question is: what is the intuition behind using a monotonically increasing stack for this? How does it help calculate the minimums in the various subarrays?
Visualize the array as a line graph, with (local) minima as valleys. Each value is relevant for a range that extends from just after the previous smaller value (if any) to just before the next smaller value (if any). (Even a value larger than its neighbors matters when considering the singleton subarray that contains it.) The variables left and right track that range.
Recognizing that a value shadows every value larger than it in each direction separately, the stack maintains a list of previous, unshadowed minima for two purposes: identifying how far back a new small number’s range extends and (at the same time) how far forward the invalidated minima’s ranges extend. The code uses a separate stack for each purpose, but there’s no need: each has the same contents after every iteration of the (outer) loop.

Fastest way to find smallest missing integer from list of integers

I have a list of 100 random integers. Each random integer has a value from 0 to 99. Duplicates are allowed, so the list could be something like
56, 1, 1, 1, 1, 0, 2, 6, 99...
I need to find the smallest integer (>= 0) is that is not contained in the list.
My initial solution is this:
vector<int> integerList(100); //list of random integers
...
vector<bool> listedIntegers(101, false);
for (int theInt : integerList)
{
listedIntegers[theInt] = true;
}
int smallestInt;
for (int j = 0; j < 101; j++)
{
if (!listedIntegers[j])
{
smallestInt = j;
break;
}
}
But that requires a secondary array for book-keeping and a second (potentially full) list iteration. I need to perform this task millions of times (the actual application is in a greedy graph coloring algorithm, where I need to find the smallest unused color value with a vertex adjacency list), so I'm wondering if there's a clever way to get the same result without so much overhead?
It's been a year, but ...
One idea that comes to mind is to keep track of the interval(s) of unused values as you iterate the list. To allow efficient lookup, you could keep intervals as tuples in a binary search tree, for example.
So, using your sample data:
56, 1, 1, 1, 1, 0, 2, 6, 99...
You would initially have the unused interval [0..99], and then, as each input value is processed:
56: [0..55][57..99]
1: [0..0][2..55][57..99]
1: no change
1: no change
1: no change
0: [2..55][57..99]
2: [3..55][57..99]
6: [3..5][7..55][57..99]
99: [3..5][7..55][57..98]
Result (lowest value in lowest remaining interval): 3
I believe there is no faster way to do it. What you can do in your case is to reuse vector<bool>, you need to have just one such vector per thread.
Though the better approach might be to reconsider the whole algorithm to eliminate this step at all. Maybe you can update least unused color on every step of the algorithm?
Since you have to scan the whole list no matter what, the algorithm you have is already pretty good. The only improvement I can suggest without measuring (that will surely speed things up) is to get rid of your vector<bool>, and replace it with a stack-allocated array of 4 32-bit integers or 2 64-bit integers.
Then you won't have to pay the cost of allocating an array on the heap every time, and you can get the first unused number (the position of the first 0 bit) much faster. To find the word that contains the first 0 bit, you only need to find the first one that isn't the maximum value, and there are bit twiddling hacks you can use to get the first 0 bit in that word very quickly.
You program is already very efficient, in O(n). Only marginal gain can be found.
One possibility is to divide the number of possible values in blocks of size block, and to register
not in an array of bool but in an array of int, in this case memorizing the value modulo block.
In practice, we replace a loop of size N by a loop of size N/block plus a loop of size block.
Theoretically, we could select block = sqrt(N) = 12 in order to minimize the quantity N/block + block.
In the program hereafter, block of size 8 are selected, assuming that dividing integers by 8 and calculating values modulo 8 should be fast.
However, it is clear that a gain, if any, can be obtained only for a minimum value rather large!
constexpr int N = 100;
int find_min1 (const std::vector<int> &IntegerList) {
constexpr int Size = 13; //N / block
constexpr int block = 8;
constexpr int Vmax = 255; // 2^block - 1
int listedBlocks[Size] = {0};
for (int theInt : IntegerList) {
listedBlocks[theInt / block] |= 1 << (theInt % block);
}
for (int j = 0; j < Size; j++) {
if (listedBlocks[j] == Vmax) continue;
int &k = listedBlocks[j];
for (int b = 0; b < block; b++) {
if ((k%2) == 0) return block * j + b;
k /= 2;
}
}
return -1;
}
Potentially you can reduce the last step to O(1) by using some bit manipulation, in your case __int128, set the corresponding bits in loop one and call something like __builtin_clz or use the appropriate bit hack
The best solution I could find for finding smallest integer from a set is https://codereview.stackexchange.com/a/179042/31480
Here are c++ version.
int solution(std::vector<int>& A)
{
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
while (0 < A[i] && A[i] - 1 < A.size()
&& A[i] != i + 1
&& A[i] != A[A[i] - 1])
{
int j = A[i] - 1;
auto tmp = A[i];
A[i] = A[j];
A[j] = tmp;
}
}
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
if (A[i] != i+1)
{
return i + 1;
}
}
return A.size() + 1;
}

Implementing iterative autocorrelation process in C++ using for loops

I am implementing pitch tracking using an autocorrelation method in C++ but I am struggling to write the actual line of code which performs the autocorrelation.
I have an array containing a certain number ('values') of amplitude values of a pre-recorded signal, and I am performing the autocorrelation function on a set number (N) of these values.
In order to perform the autocorrelation I have taken the original array and reversed it so that point 0 = point N, point 1 = point N-1 etc, this array is called revarray
Here is what I want to do mathematically:
(array[0] * revarray[0])
(array[0] * revarray[1]) + (array[1] * revarray[0])
(array[0] * revarray[2]) + (array[1] * revarray[1]) + (array[2] * revarray[0])
(array[0] * revarray[3]) + (array[1] * revarray[2]) + (array[2] * revarray[1]) + (array[3] * revarray[0])
...and so on. This will be repeated for array[900]->array[1799] etc until autocorrelation has been performed on all of the samples in the array.
The number of times the autocorrelation is carried out is:
values / N = measurements
Here is the relevent section of my code so far
for (k = 0; k = measurements; ++k){
for (i = k*(N - 1), j = k*N; i >= 0; i--, j++){
revarray[j] = array[i];
for (a = k*N; a = k*(N - 1); ++a){
autocor[a]=0;
for (b = k*N; b = k*(N - 1); ++b){
autocor[a] += //**Here is where I'm confused**//
}
}
}
}
I know that I want to keep iteratively adding new values to autocor[a], but my problem is that the value that needs to be added to will keep changing. I've tried using an increasing count like so:
for (i = (k*N); i = k*(N-1); ++i){
autocor[i] += array[i] * revarray[i-1]
}
But I clearly know this won't work as when the new value is added to the previous autocor[i] this previous value will be incorrect, and when i=0 it will be impossible to calculate using revarray[i-1]
Any suggestions? Been struggling with this for a while now. I managed to get it working on just a single array (not taking N samples at a time) as seen here but I think using the inverted array is a much more efficient approach, I'm just struggling to implement the autocorrelation by taking sections of the entire signal.
It is not very clear to me, but I'll assume that you need to perform your iterations as many times as there are elements in that array (if it is indeed only half that much - adjust the code accordingly).
Also the N is assumed to mean the size of the array, so the index of the last element is N-1.
The loops would looks like that:
for(size_t i = 0; i < N; ++i){
autocorr[i] = 0;
for(size_t j = 0; j <= i; ++j){
const size_t idxA = j
, idxR = i - j; // direct and reverse indices in the array
autocorr[i] += array[idxA] * array[idxR];
}
}
Basically you run the outer loop as many times as there are elements in your array and for each of those iterations you run a shorter loop up to the current last index of the outer array.
All that is left to be done now is to properly calculate the indices of the array and revarray to perform the calculations and accummulate a running sum in the current outer loop's index.

Optimizing this code block

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.
What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.

Porting optimized Sieve of Eratosthenes from Python to C++

Some time ago I used the (blazing fast) primesieve in python that I found here: Fastest way to list all primes below N
To be precise, this implementation:
def primes2(n):
""" Input n>=6, Returns a list of primes, 2 <= p < n """
n, correction = n-n%6+6, 2-(n%6>1)
sieve = [True] * (n/3)
for i in xrange(1,int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ k*k/3 ::2*k] = [False] * ((n/6-k*k/6-1)/k+1)
sieve[k*(k-2*(i&1)+4)/3::2*k] = [False] * ((n/6-k*(k-2*(i&1)+4)/6-1)/k+1)
return [2,3] + [3*i+1|1 for i in xrange(1,n/3-correction) if sieve[i]]
Now I can slightly grasp the idea of the optimizing by automaticly skipping multiples of 2, 3 and so on, but when it comes to porting this algorithm to C++ I get stuck (I have a good understanding of python and a reasonable/bad understanding of C++, but good enough for rock 'n roll).
What I currently have rolled myself is this (isqrt() is just a simple integer square root function):
template <class T>
void primesbelow(T N, std::vector<T> &primes) {
T sievemax = (N-3 + (1-(N % 2))) / 2;
T i;
T sievemaxroot = isqrt(sievemax) + 1;
boost::dynamic_bitset<> sieve(sievemax);
sieve.set();
primes.push_back(2);
for (i = 0; i <= sievemaxroot; i++) {
if (sieve[i]) {
primes.push_back(2*i+3);
for (T j = 3*i+3; j <= sievemax; j += 2*i+3) sieve[j] = 0; // filter multiples
}
}
for (; i <= sievemax; i++) {
if (sieve[i]) primes.push_back(2*i+3);
}
}
This implementation is decent and automatically skips multiples of 2, but if I could port the Python implementation I think it could be much faster (50%-30% or so).
To compare the results (in the hope this question will be successfully answered), the current execution time with N=100000000, g++ -O3 on a Q6600 Ubuntu 10.10 is 1230ms.
Now I would love some help with either understanding what the above Python implementation does or that you would port it for me (not as helpful though).
EDIT
Some extra information about what I find difficult.
I have trouble with the techniques used like the correction variable and in general how it comes together. A link to a site explaining different Eratosthenes optimizations (apart from the simple sites that say "well you just skip multiples of 2, 3 and 5" and then get slam you with a 1000 line C file) would be awesome.
I don't think I would have issues with a 100% direct and literal port, but since after all this is for learning that would be utterly useless.
EDIT
After looking at the code in the original numpy version, it actually is pretty easy to implement and with some thinking not too hard to understand. This is the C++ version I came up with. I'm posting it here in full version to help further readers in case they need a pretty efficient primesieve that is not two million lines of code. This primesieve does all primes under 100000000 in about 415 ms on the same machine as above. That's a 3x speedup, better then I expected!
#include <vector>
#include <boost/dynamic_bitset.hpp>
// http://vault.embedded.com/98/9802fe2.htm - integer square root
unsigned short isqrt(unsigned long a) {
unsigned long rem = 0;
unsigned long root = 0;
for (short i = 0; i < 16; i++) {
root <<= 1;
rem = ((rem << 2) + (a >> 30));
a <<= 2;
root++;
if (root <= rem) {
rem -= root;
root++;
} else root--;
}
return static_cast<unsigned short> (root >> 1);
}
// https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
// https://stackoverflow.com/questions/5293238/porting-optimized-sieve-of-eratosthenes-from-python-to-c/5293492
template <class T>
void primesbelow(T N, std::vector<T> &primes) {
T i, j, k, l, sievemax, sievemaxroot;
sievemax = N/3;
if ((N % 6) == 2) sievemax++;
sievemaxroot = isqrt(N)/3;
boost::dynamic_bitset<> sieve(sievemax);
sieve.set();
primes.push_back(2);
primes.push_back(3);
for (i = 1; i <= sievemaxroot; i++) {
if (sieve[i]) {
k = (3*i + 1) | 1;
l = (4*k-2*k*(i&1)) / 3;
for (j = k*k/3; j < sievemax; j += 2*k) {
sieve[j] = 0;
sieve[j+l] = 0;
}
primes.push_back(k);
}
}
for (i = sievemaxroot + 1; i < sievemax; i++) {
if (sieve[i]) primes.push_back((3*i+1)|1);
}
}
I'll try to explain as much as I can. The sieve array has an unusual indexing scheme; it stores a bit for each number that is congruent to 1 or 5 mod 6. Thus, a number 6*k + 1 will be stored in position 2*k and k*6 + 5 will be stored in position 2*k + 1. The 3*i+1|1 operation is the inverse of that: it takes numbers of the form 2*n and converts them into 6*n + 1, and takes 2*n + 1 and converts it into 6*n + 5 (the +1|1 thing converts 0 to 1 and 3 to 5). The main loop iterates k through all numbers with that property, starting with 5 (when i is 1); i is the corresponding index into sieve for the number k. The first slice update to sieve then clears all bits in the sieve with indexes of the form k*k/3 + 2*m*k (for m a natural number); the corresponding numbers for those indexes start at k^2 and increase by 6*k at each step. The second slice update starts at index k*(k-2*(i&1)+4)/3 (number k * (k+4) for k congruent to 1 mod 6 and k * (k+2) otherwise) and similarly increases the number by 6*k at each step.
Here's another attempt at an explanation: let candidates be the set of all numbers that are at least 5 and are congruent to either 1 or 5 mod 6. If you multiply two elements in that set, you get another element in the set. Let succ(k) for some k in candidates be the next element (in numerical order) in candidates that is larger than k. In that case, the inner loop of the sieve is basically (using normal indexing for sieve):
for k in candidates:
for (l = k; ; l += 6) sieve[k * l] = False
for (l = succ(k); ; l += 6) sieve[k * l] = False
Because of the limitations on which elements are stored in sieve, that is the same as:
for k in candidates:
for l in candidates where l >= k:
sieve[k * l] = False
which will remove all multiples of k in candidates (other than k itself) from the sieve at some point (either when the current k was used as l earlier or when it is used as k now).
Piggy-Backing onto Howard Hinnant's response, Howard, you don't have to test numbers in the set of all natural numbers not divisible by 2, 3 or 5 for primality, per se. You need simply multiply each number in the array (except 1, which self-eliminates) times itself and every subsequent number in the array. These overlapping products will give you all the non-primes in the array up to whatever point you extend the deterministic-multiplicative process. Thus the first non-prime in the array will be 7 squared, or 49. The 2nd, 7 times 11, or 77, etc. A full explanation here: http://www.primesdemystified.com
As an aside, you can "approximate" prime numbers. Call the approximate prime P. Here are a few formulas:
P = 2*k+1 // not divisible by 2
P = 6*k + {1, 5} // not divisible 2, 3
P = 30*k + {1, 7, 11, 13, 17, 19, 23, 29} // not divisble by 2, 3, 5
The properties of the set of numbers found by these formulas is that P may not be prime, however all primes are in the set P. I.e. if you only test numbers in the set P for prime, you won't miss any.
You can reformulate these formulas to:
P = X*k + {-i, -j, -k, k, j, i}
if that is more convenient for you.
Here is some code that uses this technique with a formula for P not divisible by 2, 3, 5, 7.
This link may represent the extent to which this technique can be practically leveraged.