Count number of matching bytes between two _m128i SIMD vectors

Count number of matching bytes between two _m128i SIMD vectors - c++

I'm developing a bioinformatics tool and I'm trying to use SIMD to boost its speed.
Given two char arrays of length 16, I need to rapidly count the number of indices at which the strings match. For example, the two following strings, "TTTTTTTTTTTTTTTT" and "AAAAGGGGTTTTCCCC", match from 9th through 12th positions ("TTTT"), and therefore the output should be 4.
As shown in the following function foo (which works fine but slow), I packed each characters in seq1 and seq2 into __m128i variables s1 and s2, and used _mm_cmpeq_epi8 to compare every position simultaneously. Then, using popcnt128 (from Fast counting the number of set bits in __m128i register by Marat Dukhan) to add up the number of matching bits.
float foo(char* seq1, char* seq2) {
__m128i s1, s2, ceq;
int match;
s1 = _mm_load_si128((__m128i*)(seq1));
s2 = _mm_load_si128((__m128i*)(seq2));
ceq = _mm_cmpeq_epi8(s1, s2);
match = (popcnt128(ceq)/8);
return match;
}
Although popcnt128 by Marat Dukhan is a lot faster than naïvely adding up every bit in __m128i, __popcnt128() is the slowest bottleneck in the function, taking up about 80% of the computational speed. So, I would like to come up with an alternative to popcnt128.
I tried to interpret __m128i ceq as a string and to use it as a key for a precomputed look-up table that maps a string to the total number of bits. If char array were hashable, I could do something like
union{__m128i ceq; char c_arr[16];}
match = table[c_arr] // table = unordered map
If I try to do something similar for strings (i.e. union{__m128i ceq; string s;};), I get the following error message "::()’ is implicitly deleted because the default definition would be ill-formed". When I tried other things, I ran into segmentation faults.
Is there any way I can tell the compiler to read __m128i as string so I can directly use __m128i as a key for unordered_map? I don't see why it shouldn't work because string is a contiguous array of chars, which can be naturally represented by __m128i. But I couldn't get it to work and unable to find any solution online.

You're probably doing this for longer sequences, multiple SIMD vectors of data. In that case, you can accumulate counts in a vector that you only sum up at the end. It's a lot less efficient to popcount every vector separately.
See How to count character occurrences using SIMD - instead of _mm256_set1_epi8(c); to search for a specific character, load from the other string. Do everything else the same, including
counts = _mm_sub_epi8(counts, _mm_cmpeq_epi8(s1, s2));
in the inner loop, and the loop unrolling. (A compare result is an integer 0 / -1, so subtracting it adds 0 or 1 to another vector.) This is at risk of overflow after 256 iterations, so do at most 255. That linked question uses AVX2, but the __m128i versions of those intrinsics only require SSE2. (Of course, AVX2 would let you get twice as much work done per vector instruction.)
Horizontal sum the byte counters in the outer loop, using _mm_sad_epu8(v, _mm_setzero_si128()); and then accumulating into another vector of counts. Again, this is all in the code in the linked Q&A, so just copy/paste that and add a load from the other string into the inner loop, instead of using a broadcast constant.
Can counting byte matches between two strings be optimized using SIMD? shows basically the same thing for 128-bit vectors, including a version at the bottom that only does SAD hsums after an inner loop. It's written for two input pointers already, rather than char and string.
For a single vector:
You don't need to count all the bits in your __m128i; take advantage of the fact that all 8 bits in each byte are the same by extracting 1 bit per element to a scalar integer. (x86 SIMD can do that efficiently, unlike some other SIMD ISAs)
count = __builtin_popcnt(_mm_movemask_epi8(cmp_result));
Another possible option is psadbw against 0 (hsum of bytes on the compare result), but that needs a final hsum step of qword halves, so that's going to be worse than HW popcnt. But if you can't compile with -mpopcnt then it's worth considering if you need baseline x86-64 with just SSE2. (Also you need to negate before psadbw, or scale the sum down by 1/255...)
(Note that the psadbw strategy is basically what I described in the first section of the answer, but for only a single vector, not taking advantage of the ability to cheaply add multiple counts into one vector accumulator.)
If you really need the result as a float, then that makes a psadbw strategy less bad: you can keep the value in SIMD vectors the whole time, using _mm_cvtepi32_ps to do packed conversion on the horizontal sum result (even cheaper than cvtsi2ss int->float scalar conversion). _mm_cvtps_f32 is free; a scalar float is just the low element of an XMM register.
But seriously, do you really need an integer count as a float now? Can't you at least wait until you have the sum across all vectors, or keep it integer?
-mpopcnt is implied by gcc -msse4.2, or -march=native on anything less than 10 years old. Core 2 lacked hardware popcnt, but Nehalem had it for Intel.

Related

Parallel binomial coefficients using SIMD instructions

Background
I've recently been taking some old code (~1998) and re-writing some of it to improve performance. Previously in the basic data structures for a state I stored elements in several arrays, and now I'm using raw bits (for the cases that requires less than 64 bits). That is, before I had an array of b elements and now I have b bits set in a single 64-bit integer that indicate whether that value is part of my state.
Using intrinsics like _pext_u64 and _pdep_u64 I've managed to get all operations 5-10x faster. I am working on the last operation, which has to do with computing a perfect hash function.
The exact details of the hash function aren't too important, but it boils down to computing binomial coefficients (n choose k - n!/((n-k)!k!) for various n and k. My current code uses a large lookup table for this, which is probably hard to speed up significantly on its own (except for possible cache misses in the table which I haven't measured).
But, I was thinking that with SIMD instructions I might be able to directly compute these for several states in parallel, and thus see an overall performance boost.
Some constraints:
There are always exactly b bits set in each 64-bit state (representing small numbers).
The k value in the binomial coefficients is related to b and changes uniformly in the calculation. These values are small (most of the time <= 5).
The final hash will be < 15 million (easily fits in 32 bits).
So, I can fairly easily write out the math for doing this in parallel and for keeping all operations as integer multiple/divide without remainders while keeping within 32 bits. The overall flow is:
Extract the bits into values suitable for SIMD instructions.
Perform the n choose k computation in a way to avoid overflow.
Extract out the final hash value from each entry
But, I haven't written SIMD code before, so I'm still getting up to speed on all the functions available and their caveats/efficiencies.
Example:
Previously I would have had my data in an array, supposing there are always 5 elements:
[3 7 19 31 38]
Now I'm using a single 64-bit value for this:
0x880080088
This makes many other operations very efficient. For the perfect hash I need to compute something like this efficiently (using c for choose):
(50c5)-(38c5) + (37c4)-(31c4) + (30c3)-(19c3) + ...
But, in practice I have a bunch of these to compute, just with slightly different values:
(50c5)-(Xc5) + ((X-1)c4)-(Yc4) + ((Y-1)c3)-(Zc3) + ...
All the X/Y/Z... will be different but the form of the calculation is identical for each.
Questions:
Is my intuition on gaining efficiency by converting to SIMD operations reasonable? (Some sources suggest "no", but that's the problem of computing a single coefficient, not doing several in parallel.)
Is there something more efficient than repeated _tzcnt_u64 calls for extracting bits into the data structures for SIMD operations? (For instance, I could temporarily break my 64-bit state representation into 32-bit chunks if it would help, but then I wouldn't be guaranteed to have the same number of bits set in each element.)
What are the best intrinsics for computing several sequential multiply/divide operations for the binomial coefficients when I know there won't be overflow. (When I look through the Intel references I have trouble interpreting the naming quickly when going through all the variants - it isn't clear that what I want is available.)
If directly computing the coefficients is unlikely to be efficient, can SIMD instructions be used for parallel lookups into my previous lookup table of coefficients?
(I apologize for putting several questions together, but given the specific context, I thought it would be better to put them together as one.)

Here is one possible solution that does the computation from a lookup table using one state at a time. It's probably going to be more efficient to do this in parallel over several states instead of using a single state. Note: This is hard-coded for the fixed case of getting combinations of 6 elements.
int64_t GetPerfectHash2(State &s)
{
// 6 values will be used
__m256i offsetsm1 = _mm256_setr_epi32(6*boardSize-1,5*boardSize-1,
4*boardSize-1,3*boardSize-1,
2*boardSize-1,1*boardSize-1,0,0);
__m256i offsetsm2 = _mm256_setr_epi32(6*boardSize-2,5*boardSize-2,
4*boardSize-2,3*boardSize-2,
2*boardSize-2,1*boardSize-2,0,0);
int32_t index[9];
uint64_t value = _pext_u64(s.index2, ~s.index1);
index[0] = boardSize-numItemsSet+1;
for (int x = 1; x < 7; x++)
{
index[x] = boardSize-numItemsSet-_tzcnt_u64(value);
value = _blsr_u64(value);
}
index[8] = index[7] = 0;
// Load values and get index in table
__m256i firstLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[0]), offsetsm2);
__m256i secondLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[1]), offsetsm1);
// Lookup in table
__m256i values1 = _mm256_i32gather_epi32(combinations, firstLookup, 4);
__m256i values2 = _mm256_i32gather_epi32(combinations, secondLookup, 4);
// Subtract the terms
__m256i finalValues = _mm256_sub_epi32(values1, values2);
_mm256_storeu_si256((__m256i*)index, finalValues);
// Extract out final sum
int64_t result = 0;
for (int x = 0; x < 6; x++)
{
result += index[x];
}
return result;
}
Note that I actually have two similar cases. In the first case I don't need the _pext_u64 and this code is ~3x slower than my existing code. In the second case I need it, and it is 25% faster.

SIMD: more generic shuffle function

I think the SIMD shuffle fucntion is not real shuffle for int32_t case the left and right part would be shuffled separately.
I want a real shuffle function as following:
Assumed we got __m256i and we want to shuffle 8 int32_t.
__m256i to_shuffle = _mm256_set_epi32(17, 18, 20, 21, 25, 26, 29, 31);
const int imm8 = 0b10101100;
__m256i shuffled _mm256_shuffle(to_shuffle, imm8);
I hope the shuffled = {17, 20, 25, 26, -, -, -, -}, where the - represents the not relevant value and they can be anything.
So I hope the int at the position with set bit with 1 would be placed in shuffled.
(In our case: 17, 20, 25, 26 are sitting at the positions with a 1 in the imm8).
Is such function offered by the Intel？
How could such function be implemented efficiently?
EDIT: - could be ignored. Only the int with set bit 1 is needed.

(I'm assuming you got your immediate backwards (selector for 17 should be the low bit, not high bit) and your vectors are actually written in low-element-first order).
How could such function be implemented efficiently?
In this case with AVX2 vpermd ( _mm256_permutevar8x32_epi32 ). It needs a control vector not an immediate, to hold 8 selectors for the 8 output elements. So you'd have to load a constant and use that as the control operand.
Since you only care about the low half of your output vector, your vector constant can be only __m128i, saving space. vmovdqa xmm, [mem] zero-extends into the corresponding YMM vector. It's probably inconvenient to write this in C with intrinsics but _mm256_castsi128_si256 should work. Or even _mm256_broadcastsi128_si256 because a broadcast-load would be just as cheap. Still, some compilers might pessimize it to an actual 32-byte constant in memory by doing constant-propagation. If you know assembly, compiler output is frequently disappointing.
If you want to take an actual integer bitmap in your source, you could probably use C++ templates to convert that at compile time into the right vector constant. Agner Fog's Vector Class Library (now Apache-licensed, previously GPL) has some related things like that, turning integer constants into a single blend or sequence of blend instructions depending on the constant and what target ISA is supported, using C++ templates. But its shuffle template takes a list of indices, not a bitmap.
But I think you're trying to ask about why / how x86 shuffles are designed the way they are.
Is such function offered by the Intel？
Yes, in hardware with AVX512F (plus AVX512VL to use it on 256-bit vectors).
You're looking for vpcompressd, the vector-element equivalent of BMI2 pext. (But it takes the control operand as a mask register value, not an immediate constant.) The intrinsic is
__m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a);
It's also available in a version that merges into the bottom of an existing vector instead of zeroing the top elements.
As an immediate shuffle, no.
All x86 shuffles use a control operand that has indices into the source, not a bitmap of which elements to keep. (Except vpcompressd/q and vpexpandd/q). Or they use an implicit control, like _mm256_unpacklo_epi32 for example which interleaves 32-bit elements from 2 inputs (in-lane in the low and high halves).
If you're going to provide a shuffle with a control operand at all, it's usually most useful if any element can end up at any position. So the output doesn't have to be in the same order as the input. Your compress shuffle doesn't have that property.
Also, having a source index for each output element is what shuffle hardware naturally wants. My understanding is that each output element is fed by its own MUX (multiplexer), where the MUX takes N input elements and one binary selector to select which one to output. (And is as wide as the element width of course.) See Where is VPERMB in AVX2? for more discussion of building muxers.
Having the control operand in some format other than a list of selectors would require preprocessing before it could be fed to shuffle hardware.
For an immediate, the format is either 2x1-bit or 4x2-bit fields, or a byte-shift count for _mm_bslli_si128 and _mm_alignr_epi8. Or index + zeroing bitmask for insertps. There are no SIMD instructions with an immediate wider than 8 bits. Presumably this keeps the hardware decoders simple.
(Or 1x1-bit for vextractf128 xmm, ymm, 0 or 1, which in hindsight would be better with no immediate at all. Using it with 0 is always worse than vmovdqa xmm, xmm. Although AVX512 does use the same opcode for vextractf32x4 with an EVEX prefix for the 1x2-bit immediate, so maybe this had some benefit for decoder complexity. Anyway, there are no immediate shuffles with selector fields wider than 2 bits because 8x 3-bit would be 24 bits.)
For wider 4x2 in-lane shuffles like _mm256_shuffle_ps (vshufps ymm, ymm, ymm, imm8), the same 4x2-bit selector pattern is reused for both lanes. For wider 2x1 in-lane shuffles like _mm256_shuffle_pd (vshufpd ymm, ymm, ymm, imm8), we get 4x 1-bit immediate fields that still select in-lane.
There are lane-crossing shuffles with 4x 2-bit selectors, vpermq and vpermpd. Those work exactly like pshufd xmm (_mm_shuffle_epi32) but with 4x qword elements across a 256-bit register instead of 4x dword elements across a 128-bit register.
As far as narrowing / only caring about part of the output:
A normal immediate would need 4x 3-bit selectors to each index one of the 8x 32-bit source elements. But much more likely 8x 3-bit selectors = 24 bits, because why design a shuffle instruction that can only ever write half a half-width output? (Other than vextractf128 xmm, ymm, 1).
General the paradigm for more-granular shuffles is to take a control vector, rather than some funky immediate encoding.
AVX512 did add some narrowing shuffles like VPMOVDB xmm/[mem], x/y/zmm that truncate (or signed/unsigned saturate) 32-bit elements down to 8-bit. (And all other combinations of sizes are available).
They're interesting because they're available with a memory destination. Perhaps this is motivated by some CPUs (like Xeon Phi KNL / KNM) not having AVX512VL, so they can only use AVX512 instructions with ZMM vectors. Still, they have AVX1 and 2 so you could compress into an xmm reg and use a normal VEX-encoded store. But it does allow doing a narrow byte-masked store with AVX512F, which would only be possible with AVX512BW if you had the packed data in an XMM register.
There are some 2-input shuffles like shufps that treat the low and high half of the output separately, e.g. the low half of the output can select from elements of the first source, the high half of the output can select from elements of the second source register.

Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register.
The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2):
{(0,0),(1,1),(1,4),(2,5),...}
This means several bytes are copied twice.
I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes.
__m256 _mm256_permute_ps (__m256 a, int imm8)
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
I'm totally confused about the Shuffle Mask in imm8 and how to design it so that it would work as described above.
I had a look in this slides(page 26) were _MM_SHUFFLE is described but I can't find a solution to my problem.
Are there any tutorials on how to design such a mask? Or example functions for the two methods to understand them in depth?
Thanks in advance for hints

TL:DR: you probably either need multiple shuffles to handle lane-crossing, or if your pattern continues exactly like that you can use _mm256_cvtepu16_epi32 (vpmovzxwd) and then _mm256_blend_epi16.
For x86 shuffles (like most SIMD instruction-sets, I think), the destination position is implicit. A shuffle-control constant just has source indices in destination order, whether it's an imm8 that gets compiled+assembled right into an asm instruction or whether it's a vector with an index in each element.
Each destination position reads exactly one source position, but the same source position can be read more than once. Each destination element gets a value from the shuffle source.
See Convert _mm_shuffle_epi32 to C expression for the permutation? for a plain-C version of dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a)), showing how the control byte is used.
(For pshufb / _mm_shuffle_epi8, an element with the high bit set zeros that destination position instead of reading any source element, but other x86 shuffles ignore all the high bits in shuffle-control vectors.)
Without AVX512 merge-masking, there are no shuffles that also blend into a destination. There are some two-source shuffles like _mm256_shuffle_ps (vshufps) which can shuffle together elements from two sources to produce a single result vector. If you wanted to leave some destination elements unwritten, you'll probably have to shuffle and then blend, e.g. with _mm256_blendv_epi8, or if you can use blend with 16-bit granularity you can use a more efficient immediate blend _mm256_blend_epi16, or even better _mm256_blend_epi32 (AVX2 vpblendd is as cheap as _mm256_and_si256 on Intel CPUs, and is the best choice if you do need to blend at all, if it can get the job done; see http://agner.org/optimize/)
For your problem (without AVX512VBMI vpermb in Cannonlake), you can't shuffle single bytes from the low 16 "lane" into the high 16 "lane" of a __m256i vector with a single operation.
AVX shuffles are not like a full 256-bit SIMD, they're more like two 128-bit operations in parallel. The only exceptions are some AVX2 lane-crossing shuffles with 32-bit granularity or larger, like vpermd (_mm256_permutevar8x32_epi32). And also the AVX2 versions of pmovzx / pmovsx, e.g. pmovzxbq does zero-extend the low 4 bytes of an XMM register into the 4 qwords of a YMM register, rather than the low 2 bytes of each half of a YMM register. This makes it much more useful with a memory source operand.
But anyway, the AVX2 version of pshufb (_mm256_shuffle_epi8) does two separate 16x16 byte shuffles in the two lanes of a 256-bit vector.
You're probably going to want something like this:
// Intrinsics have different types for integer, float and double vectors
// the asm uses the same registers either way
__m256i shuffle_and_blend(__m256i dst, __m256i src)
{
// setr takes element in low to high order, like a C array init
// unlike the standard Intel notation where high element is first
const __m256i shuffle_control = _mm256_setr_epi8(
0, 1, -1, -1, 1, 2, ...);
// {(0,0), (1,1), (zero) (1,4), (2,5),...} in your src,dst notation
// Use -1 or 0x80 or anything with the high bit set
// for positions you want to leave unmodified in dst
// blendv uses the high bit as a blend control, so the same vector can do double duty
// maybe need some lane-crossing stuff depending on the pattern of your shuffle.
__m256i shuffled = _mm256_shuffle_epi8(src, shuffle_control);
// or if the pattern continues, and you're just leaving 2 bytes between every 2-byte group:
shuffled = _mm256_cvtepu16_epi32(src); // if src is a __m128i
__m256i blended = _mm256_blendv_epi8(shuffled, dst, shuffle_control);
// blend dst elements we want to keep into the shuffled src result.
return blended;
}
Note that the pshufb numbering restarts from 0 for the 2nd 16 bytes. The two halves of the __m256i can be different, but they can't read elements from the other half. If you need positions in the high lane to get bytes from the low lane, you'll need more shuffling + blending (e.g. including vinserti128 or vperm2i128, or maybe a vpermd lane-crossing dword shuffle) to get all the bytes you need into one 16-byte group in some order.
(Actually _mm256_shuffle_epi8 (PSHUFB) ignores bits 4..6 in a shuffle index, so writing 17 is the same as 1, but very misleading. It's effectively doing a %16, as long as the high bit isn't set. If the high bit is set in the shuffle-control vector, it zeros that element. We don't need that functionality here; _mm256_blendv_epi8 doesn't care about the old value of the element it's replacing)
Anyway, this simple 2-instruction example only works if the pattern doesn't continue. If you want help designing your real shuffles, you'll have to ask a more specific question.
And BTW, I notice that your blend pattern used 2 new bytes then 2 skipped 2. If that continues, you could use vpblendw _mm256_blend_epi16 instead of blendv, because that instruction runs in only 1 uop instead of 2 on Intel CPUs. It would also allow you to use AVX512BW vpermw, a 16-bit shuffle available in current Skylake-AVX512 CPUs, instead of the probably-even-slower AVX512VBMI vpermb.
Or actually, it would maybe let you use vpmovzxwd (_mm256_cvtepu16_epi32) to zero-extend 16-bit elements to 32-bit, as a lane-crossing shuffle. Then blend with dst.

SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

A piece of C++ code determines the occurances of zero and keeps a binary flag variable for each number that is checked. The value of the flag toggles between 0 and 1 each time a zero is encountered in a 1 dimensional array.
I am attempting to use SSE to speed it up, but I am unsure of how to go about this. Evaluating the individual fields of __m128i is inefficient, I've read.
The code in C++ is:
int flag = 0;
int var_num2[1000];
for(int i = 0; i<1000; i++)
{
if (var[i] == 0)
{
var_num2[i] = flag;
flag = !flag; //toggle value upon encountering a 0
}
}
How should I go about this using SSE intrinsics?

You'd have to recognize the problem, but this is a variation of a well-known problem. I'll first give a theoretical description
Introduce a temporary array not_var[] which contains 1 if var contains 0 and 0 otherwise.
Introduce a temporary array not_var_sum[] which holds the partial sum of not_var.
var_num2 is now the LSB of not_var_sum[]
The first and third operation are trivially parallelizable. Parallelizing a partial sum is only a bit harder.
In a practical implementation, you wouldn't construct not_var[], and you'd write the LSB directly to var_num2 in all iterations of step 2. This is valid because you can discard the higher bits. Keeping just the LSB is equivalent to taking the result modulo 2, and (a+b)%2 == ((a%2) + (b%2))%s.

What type are the elements of var[]? int? Or char? Are zeroes frequent?
A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)
An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).
But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.
I think your best bet is to generate a list of indices where you need to store, and then
for (size_t j = 0 ; j < scatter_count ; j+=2) {
var_num2[ scatter_element[j+0] ] = 0;
var_num2[ scatter_element[j+1] ] = 1;
}
You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.
The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.
To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.
BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)
To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.
If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.
Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

Fast code for searching bit-array for contiguous set/clear bits?

Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).

Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.

I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...

Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js