How to bitwise operate on memory block (C++) - c++

Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop? After looking it to options I noticed that std has a member std::bitset, and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?
Edit / update: I think union might apply here, such that the memory block is allocated a new array of int or something and then manipulated as a large bitset. Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .

In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:
Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
[edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.
C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.
size_t len = ...;
char *buffer = ...;
size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;
long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;
// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
size_t j = loads_per_i*i;
ptr[j ] ^= xorvalue;
ptr[j+1] ^= xorvalue;
ptr[j+2] ^= xorvalue;
ptr[j+3] ^= xorvalue;
}
// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
ptr[i] ^= xorvalue;
}
// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
buffer[i] ^= xorvalue;
}

Related

Efficient equality test for bitstrings with arbitrary offsets

I have more than 1e7 sequences of tokens, where each token can only take one of four possible values.
In order to make this dataset fit into memory, I decided to encode each token in 2 bits, which allows to store 4 tokens in a byte instead of just one (when using a char for each token / std::string for a sequence). I store each sequence in a char array.
For some algorithm, I need to test arbitrary subsequences of two token sequences for exact equality. Each subsequence can have an arbitrary offset. The length is typically between 10 and 30 tokens (random) and is the same for the two subsequences.
My current method is to operate in chunks:
Copy up to 32 tokens (each having 2 bit) from each subsequences into an uint64_t. This is realized in a loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t.
Compare the two uint64_t. If they are not equal, return.
Repeat until all tokens in the subsequences have been processed.
#include <climits>
#include <cstdint>
using Block = char;
constexpr int BitsPerToken = 2;
constexpr int TokenPerBlock = sizeof(Block) * CHAR_BIT / BitsPerToken;
Block getTokenFromBlock(Block b, int nt) noexcept
{
return (b >> (nt * BitsPerToken)) & ((1UL << (BitsPerToken)) - 1);
}
bool seqEqual(Block const* seqA, int startA, int endA, Block const* seqB, int startB, int endB) noexcept
{
using CompareBlock = uint64_t;
constexpr int TokenPerCompareBlock = sizeof(CompareBlock) * CHAR_BIT / BitsPerToken;
const int len = endA - startA;
int posA = startA;
int posB = startB;
CompareBlock curA = 0;
CompareBlock curB = 0;
for (int i = 0; i < len; ++i, ++posA, ++posB)
{
const int cmpIdx = i % TokenPerBlock;
const int blockA = posA / TokenPerBlock;
const int idxA = posA % TokenPerBlock;
const int blockB = posB / TokenPerBlock;
const int idxB = posB % TokenPerBlock;
if ((i % TokenPerCompareBlock) == 0)
{
if (curA != curB)
return false;
curA = 0;
curB = 0;
}
curA += getTokenFromBlock(seqA[blockA], idxA) << (BitsPerToken * cmpIdx);
curB += getTokenFromBlock(seqB[blockB], idxB) << (BitsPerToken * cmpIdx);
}
if (curA != curB)
return false;
return true;
}
I figured that this should be quite fast (comparing 32 tokens simultaneously), but it is more than two times slower than using an std::string (with each token stored in a char) and its operator==.
I have looked into std::memcmp, but cannot use it because the subsequence might start somewhere within a byte (at a multiple of 2 bits, though).
Another candidate would be boost::dynamic_bitset, which basically implements the same storage format. However, it does not include equality tests.
How can I achieve fast equality tests using this compressed format?
First of all, this is the kind of computation where the target processor, RAM, compiler and compiler flags can drastically change the results. Unfortunately these critical information are not provided. Let's assume you use a quite recent mainstream x86-64 processor, a common DDR4-SDRAM, a compiler like Clang/GCC relatively up-to-date, and optimizations are enabled (ie. -O3 and possibly -march=native).
Clang and GCC use a fast comparison functions for comparing strings : respectively memcmp for GCC 12 and bcmp for Clang 15. The two functions are highly optimized on most platforms : they typically compare short strings by blocks of 8 bytes (uint64_t) and large strings by using SIMD instructions.
Your optimization is good to reduce the memory footprint but it introduces more computation and there is a high chance for the operation to be already compute-bound if the input buffer is already in the CPU cache. In addition, the computation is not SIMD-friendly due to the inner loop : the compiler will certainly not generate an efficient code due toe the bit-wise operations. The thing is scalar codes are slow. In fact, scalar byte-per-byte computations are generally so slow that they are usually far from being able to saturate the RAM bandwidth (at least the one achievable using only 1 core) as opposed to to memcmp. For example, a Skylake/Coffeelake processor at 4 GHz can only read 8 GiB/s from the L1 cache using a scalar byte-per-byte code while an AVX-2 SIMD code can read 256 GiB/s. For the write it is twice smaller : 4 GiB/s VS 128 GiB/s. A 1-channel DDR4-SDRAM # 3200MHz can theoretically reach ~24 GiB/s, that is, far more than a byte-per-byte scalar sequential code. The L3 cache have a much bigger bandwidth.
If you want a fast code for large sequences, then you need to either help your compiler so it can use SIMD instruction (not so easy in this case), to use non-portable SIMD intrinsics or possibly to use a relatively-portable SIMD library to generate quite-good SIMD code (though low-level platform-dependent intrinsics are more flexible/featureful).
I expect the main bottleneck to come from the "loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t". Indeed, this loop will likely generate a dependency chain of instructions (operating on the same uint64_t variable) that cannot be executed efficiently by the processor nor easily optimized by the compiler.
A typical solution would be to read blocks of 8 bytes (using memcpy to do it correctly, and hope the compiler optimize it properly). The bits can be reordered using a bswap instruction on x86-64 processors and it is not needed on big-endian processors. A shift+mask can be applied so to compare only the useful part. Here is an (untested) example to show the idea:
if(length >= 16)
{
uint64_t block1, block2;
uint64_t prev_block1 = 0, prev_block2 = 0;
unsigned int shift1 = (start1 % 4) * 2;
unsigned int shift2 = (start2 % 4) * 2;
uint64_t mask = 0xFFFFFFFFFFFFFF00ull;
// Read blocks 7 byte per 7 byte for sake of simplicity
for(size_t i=0; i<length-7 ; i+=7)
{
// Safe and cheap and GCC/Clang
memcpy(&block1, charArray1[i], 8);
memcpy(&block2, charArray2[i], 8);
// Architecture-dependent: reorder bytes on little-endian processors.
// There is a fast instruction for that on x86-64 processors: bswap.
// See: https://stackoverflow.com/questions/36497605
block1 = reorder_bytes(block1);
block2 = reorder_bytes(block2);
block1 = (block1 << shift1) & mask;
block2 = (block2 << shift2) & mask;
if(block1 != block2)
return false;
}
}
// TODO: compute the reminder part for the last block
This operation can be done using the SSE/AVX instruction set so to be faster for large sequences. Note you can perform a special optimization when shift1 == shift2 (especially when the both are equal to 0).
One should keep in mind that the bit-packing computation is pretty expensive, even using a SIMD code. It will certainly not be faster than a memcpy unless the operation is memory bound which is unlikely to be the case. For example, a Skylake/Coffeelake processor can load and compare 2 blocks of 32 bytes (ie. 32 tokens per block) in only 1 cycle (reciprocal throughput) using the AVX-2 SIMD instruction set, while there is no chance each iteration of the above bit-packing loop can take less than 2 cycles to compute 7 bytes (ie. 28 tokens). Using AVX-2 to optimize the above code is possible but the AVX lanes and the byte reordering results in several additional instructions being required so it will certainly be still slightly slower than just a basic very-fast comparisons (few cycles to compute ~120 tokens).
The only use-case where packing can help is when multiple core are used to do the computation. Indeed, in that case, the bit-packing code can scale well because it is likely compute-bound while the string-based version will quickly be limited by the speed of the RAM since it is likely memory-bound.
If there are only 10million tokens total, its 20Mbit or 2-3MB. If you keep their shifted versions in different arrays such as from 2 bit shifted to 30 bit shifted (assuming 4byte comparison at once, ignore 32 bit shift as it means just a different starting position), you can do a direct comparison (std::memcmp) with no shifting involved (fast) after selecting the right array with modulo of the arbitrary offset. But this requires the token sequence to be constant through many function calls (if not lifetime of program).
If these tokens are part of a much bigger data, you can put a caching layer (that caches fixed length chunks and joins them to get requested sub-sequence for A and B) just before the shifted initialization. Maybe LRU/LFU works fast enough if its token access pattern is cache-friendly. If its not cache friendly, then perhaps just reaching the arrays could be the bottleneck with or without shifting.
If you do checking per byte instead of per 4 bytes, it requires only 4 arrays instead of 16 and it shouldn't add too big requirement with caching.
You can also add an XOR result of fixed-length (like 50-100) sub-sequences for every offset as a way of quicker exiting. Again, this requires 4x more memory space. If XOR results of first tokens (+fixed length) are not equal, then they are not equal. This would reduce number of comparisons at least.
Another way is directly caching f(x,y)->bool like Python language does with its own caching. But this would be much worse than "fixed-length-chunked-caching & joining them" due to non-reusable parts & a lot of duplication.

Optimal Manipulation of Long Bitwise Structures [duplicate]

Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop? After looking it to options I noticed that std has a member std::bitset, and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?
Edit / update: I think union might apply here, such that the memory block is allocated a new array of int or something and then manipulated as a large bitset. Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .
In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:
Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
[edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.
C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.
size_t len = ...;
char *buffer = ...;
size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;
long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;
// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
size_t j = loads_per_i*i;
ptr[j ] ^= xorvalue;
ptr[j+1] ^= xorvalue;
ptr[j+2] ^= xorvalue;
ptr[j+3] ^= xorvalue;
}
// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
ptr[i] ^= xorvalue;
}
// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
buffer[i] ^= xorvalue;
}

Fastest way to downcast an array short to char

I have to process roughly 2000, 100 element arrays every second. The arrays come to me as shorts, w/ the data in the upper bits and need to be shifted and cast to chars. Is this as efficient as I can get, or is there a faster way to perform this operation? (I have to skip 2 of the values)
for(int i = 0; i < 48; i++)
{
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
Even if shift and bitwise operation are fast, you can try to process the short array as a char pointer as other advised in comments. It is allowed per standard and for common architectures does what is expected - left the endianness problem.
So you could try to first determine your endianness:
bool isBigEndian() {
short i = 1; // sets only lowest order bit
char *ix = reinterpret_cast<char *>(&i);
return (*ix == 0); // will be 1 if little endian
}
Your loop now becomes:
int shft = isBigEndian()? 0 : 1;
char * pb = reinterpret_cast<char *>(b);
for(int i = 0; i < 48; i++)
{
a[i] = pt[2 * i + shft];
a[i+48] = pt[2 * i + 50 + shft];
}
But as always for low level optimisation, this has to be benchmarked with the compiler and compiler options that will be used in production code.
You could put a wrapper class around these arrays, so code that accesses elements of the wrapper in order actually accesses every other byte of the underlying memory.
This will probably defeat auto-vectorization, though. Other than that, having all the code that would read a actually read b and increment its pointers by two instead of one shouldn't change the cost at all.
The two skipped elements are a problem, though. Having your operator[] do if (i>=48) i+=2 might kill this idea. memmove will often be much faster than storing one byte at a time, so you could consider using memmove to make a contiguous array of shorts that you can index even though it seems silly to copy without storing in a better format.
The trick will be to write a wrapper that completely optimizes away to no extra instructions in loops over your arrays. This is possible on x86, where scaled indexing is available in normal effective-addresses in asm instructions, so if the compiler understands what's going on, it can make code that's just as efficient.
Having arrays of shorts does take twice as much memory, so cache effects could matter.
It all depends on what you need to do with the byte arrays.
If you do need to convert, use SIMD
For x86 targets, you can get a big speedup with SIMD vectors instead of looping one char at a time. For other compile targets you care about, you can write similar special versions. I assume ARM NEON has similar shuffling capability, for example.
When writing a platform-specific version, you also get to make all the endian and unaligned-access assumptions that are true on that platform.
#ifdef __SSE2__ // will be true for all x86-64 builds and most i386 builds
#include <immintrin.h>
static __m128i pack2(const short *p) {
__m128i lo = _mm_loadu_si128((__m128i*)p);
__m128i hi = _mm_loadu_si128((__m128i*)(p + 8));
lo = _mm_srli_epi16(lo, 8); // logical shift, not arithmetic, because we need the high byte to be zero
hi = _mm_srli_epi16(hi, 8);
return _mm_packus_epi16(lo, hi); // treats input as signed, saturates to unsigned 0x0 .. 0xff range
}
#endif // SSE2
void conv(char *a, const short *b) {
#ifdef __SSE2__
for(int i = 0; i < 48; i+=16) {
__m128i low = pack2(b+i);
_mm_storeu_si128((__m128i *)(a+i), low);
__m128i high = pack2(b+i + 50);
_mm_storeu_si128((__m128i *)(a+i + 48), high);
}
#else
/******* Fallback C version *******/
for(int i = 0; i < 48; i++) {
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
#endif
}
As you can see on the Godbolt Compiler Explorer, gcc fully unrolls the loop since it's only a few iterations when storing 16B at a time.
This should perform ok, but on pre-Skylake will bottleneck on shifting both vectors of shorts before the store. Haswell can only sustain one psrli per clock. (Skylake can sustain one per 0.5c when the shift-count is an immediate. See Agner Fog's guide and insn tables, links at the x86 tag wiki.)
You might get better results from loading from (__m128i*)(1 + (char*)p) so the bytes we want are already in the low half of each 16bit element. We'd still have to mask off the high half of each element with _mm_and_si128 instead of shifting, but PAND can run on any vector execution port, so it has three per clock throughput.
More importantly, with AVX it can be combined with an unaligned load. e.g. vpand xmm0, xmm5, [rsi], where xmm5 is a mask of _mm_set1_epi16(0x00ff), and [rsi] holds 2*i + 1 + (char*)b. fused-domain uop throughput is probably going to be an issue, like is common for code with a lot of loads/stores as well as computation.
Unaligned accesses are slightly slower than aligned accesses, but at least half your vector accesses will be unaligned anyway (since skipping two shorts means skipping 4B). On Intel SnB-family CPUs, I don't think it's slower to have loads that are split across a cache-line boundary in a 15:1 split compared to a 12:4 split. (The no-split case is definitely faster, though.) If b is 16B-aligned, then it'll be worth testing the mask version against the shift version.
I didn't write up complete code for this version, because you'll end up reading one byte past the end of b unless you take special precautions. This is fine if you make sure b has padding of some sort so it doesn't go right to the end of a memory page.
AVX2
With AVX2, vpackuswb ymm operates in two separate lanes. IDK if there's anything to gain from doing the load and mask (or shift) on 256b vectors and then using a vextracti128 and 128b pack on the two halves of the 256b vector.
Or maybe do a 256b pack between two vectors and then a vpermq (_mm256_permute4x64_epi64) to sort things out:
lo = _mm256_loadu(b..); // { b[15..8] | b[7..0] }
hi = // { b[31..24] | b[23..16] }
// mask or shift
__m256i packed = _mm256_packus_epi16(lo, hi); // [ a31..24 a15..8 | a23..16 a7..0 ]
packed = _mm256_permute4x64_epi64(packed, _MM_SHUFFLE(3, 1, 2, 0));
Of course, use any portable optimizations you can in the C version. e.g. Serge Ballesta's suggestion of just copying the desired bytes after figuring out their location from the endianness of the machine. (Preferably at compile time by checking GNU C's __BYTE_ORDER__ macro.

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?
The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.
If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.
I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}
To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)
unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}
Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

Vectorized extraction of a specific pattern of shorts from an array, and also insertion into a new array

I have an array of shorts where I want to grab half of the values and put them in a new array that is half the size. I want to grab particular values in this sort of pattern, where each block is 128 bits (8 shorts). This is the only pattern I will use, it doesn't need to be "any generic pattern"!
The values in white are discarded. My array sizes will always be a power of 2. Here's the vague idea of it, unvectorized:
unsigned short size = 1 << 8;
unsigned short* data = new unsigned short[size];
...
unsigned short* newdata = new unsigned short[size >>= 1];
unsigned int* uintdata = (unsigned int*) data;
unsigned int* uintnewdata = (unsigned int*) newdata;
for (unsigned short uintsize = size >> 1, i = 0; i < uintsize; ++i)
{
uintnewdata[i] = (uintdata[i * 2] & 0xFFFF0000) | (uintdata[(i * 2) + 1] & 0x0000FFFF);
}
I started out with something like this:
static const __m128i startmask128 = _mm_setr_epi32(0xFFFF0000, 0x00000000, 0xFFFF0000, 0x00000000);
static const __m128i endmask128 = _mm_setr_epi32(0x00000000, 0x0000FFFF, 0x00000000, 0x0000FFFF);
__m128i* data128 = (__m128i*) data;
__m128i* newdata128 = (__m128i*) newdata;
and I can iteratively perform _mm_and_si128 with the masks to get the values I'm looking for, combine with _mm_or_si128, and put the results in newdata128[i]. However, I don't know how to "compress" things together and remove the values in white. And it seems if I could do that, I wouldn't need the masks at all.
How can that be done?
Anyway, eventually I will also want to do the opposite of this operation as well, and create a new array of twice the size and spread out current values within it.
I will also have new values to insert in the white blocks, which I would have to compute with each pair of shorts in the original data, iteratively. This computation would not be vectorizable, but the insertion of the resulting values should be. How could I "spread out" my current values into the new array, and what would be the best way to insert my computed values? Should I compute them all for each 128-bit iteration and put them into their own temp block (64 bit? 128 bit?), then do something to insert in bulk? Or should they be emplaced directly into my target __m128i, as it seems the cost should be equivalent to putting in a temp? If so, how could that be done without messing up my other values?
I would prefer to use SSE2 operations at most for this.
Here's an outline you can try:
Use the interleave instruction ( _mm_unpackhi/lo_epi16 ) with a register containing zero to "spread out" your 16 bit values. Now you'll have two registers looking like B_R_B_R_.
Shift right creating _B_R_B_R
AND the R's out of the first version B___B___
AND the B's out of the second version ___R___R
OR together B__RB__R
In the other direction use _mm_packs_epi32 in the end after setting it up with shift/and/or.
Each direction should be 10 SSE instructions (not counting the constants setup, zero and the AND masks, and the load/store).