AVX alignment in array

AVX alignment in array - c++

I'm using MSVC12 (Visual Studio 2013 Express) and I try to implemenent a fast multiplication of 8*8 float values. The problem is the alignment: The vector has actually 9*n values, but I always just need the first 8, so e.g. for n=0 the alignment of 32 bytes is guaranteed (when I use _mm_malloc), for n=1 the "first" value is aligned at 4*9 = 36 bytes.
for(unsigned i = 0; i < n; i++) {
float *coeff_set = (float *)_mm_malloc(909 * 100 *sizeof(float), 32);
// this works for n=0, not n=1, n=2, ...
__m256 coefficients = _mm256_load_ps(&coeff_set[9 * i]);
__m256 result = _mm256_mul_ps(coefficients, coefficients);
...
}
Is there any possibility to solve this? I would like to keep the structure of my data, but if not possible, I would change it. One solution I found was to copy the 8 floats first in an aligned array, and then load it, but the performance-loss is way too high then.

You have two choices:
Pad each set of coefficients to 16 values to maintain alignment
Use the _mm256_loadu_ps intrinsic for unaligned accesses
The first choice is more speed-efficient, while the second is more space-efficient.

Related

Vectorized function to count numbers in an array when a number is a specified power

I am attempting to vectorize this fairly expensive function (Scaler Now working!):
template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));
return (x - std::trunc(x)) < 0.000001;
}//End of isPower
Here's what I have so far (for 32-bit int only):
template<typename RETURN_T>
inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
{
RETURN_T cnt = 0;
const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));
const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);
__m256i _count = _mm256_setzero_si256();
__m256i _N_INT = _mm256_setzero_si256();
__m256 _N_DBL = _mm256_setzero_ps();
__m256 LOG_OF_N = _mm256_setzero_ps();
__m256 DIVIDE_LOG = _mm256_setzero_ps();
__m256 TRUNCATED = _mm256_setzero_ps();
__m256 CMP_MASK = _mm256_setzero_ps();
for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
{
//Set Values
_N_INT = _mm256_load_si256((__m256i*) &arr[i]);
_N_DBL = _mm256_cvtepi32_ps(_N_INT);
LOG_OF_N = _mm256_log_ps(_N_DBL);
DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);
TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));
CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);
_count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
}//End for
cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
}//End of count_powers_of
The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.
The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.
Testing:
int64_t count = 0;
for (size_t i = 0; i < vec.size(); ++i)
{
if (isPower(vec[i], 4))
{
++count;
}//End if
}//End for
std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000
std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array
This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).
It could also be this sum function (based off of code by #PeterCordes ):
inline uint32_t _mm_sum_epi32(__m128i& x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
}
inline uint32_t _mm256_sum_epi32(__m256i& v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1));
return _mm_sum_epi32(sum128);
}
I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?
Thanks for all your insights and suggestions thus far.

A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.
Re: the original version:
This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.
Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.
I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.
I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.
You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.
Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.
That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!
Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).
Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:
_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.
Updated version using (x - std::trunc(x)) < 0.000001;
This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)
Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.
float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.
I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.
__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)
Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.

How can I convert a vector of float to short int using avx instructions?

Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256, while result is of type short int* or short int[8].
for(i = 0; i < 8; i++)
result[i] = (short int)result_in_float[i];
I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those values (in the form of 16 bit integers) to the memory, and I want to do that all using vector instructions.
Searching around the internet, I found an intrinsic by the name of_mm256_mask_storeu_epi16, but I'm not really sure if that would do the trick, as I couldn't find an example of its usage.

_mm256_cvtps_epi32 is a good first step, the conversion to a packed vector of shorts is a bit annoying, requiring a cross-slice shuffle (so it's good that it's not in a dependency chain here).
Since the values can be assumed to be in the right range (as per the comment), we can use _mm256_packs_epi32 instead of _mm256_shuffle_epi8 to do the conversion, either way it's a 1-cycle instruction on port 5 but using _mm256_packs_epi32 avoids having to get a shuffle mask from somewhere.
So to put it together (not tested)
__m256i tmp = _mm256_cvtps_epi32(result_in_float);
tmp = _mm256_packs_epi32(tmp, _mm256_setzero_si256());
tmp = _mm256_permute4x64_epi64(tmp, 0xD8);
__m128i res = _mm256_castsi256_si128(tmp);
// _mm_store_si128 that
The last step (cast) is free, it just changes the type.
If you had two vectors of floats to convert, you could re-use most of the instructions, eg: (not tested either)
__m256i tmp1 = _mm256_cvtps_epi32(result_in_float1);
__m256i tmp2 = _mm256_cvtps_epi32(result_in_float2);
tmp1 = _mm256_packs_epi32(tmp1, tmp2);
tmp1 = _mm256_permute4x64_epi64(tmp1, 0xD8);
// _mm256_store_si256 this

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float
Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..
After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).
The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?

a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them in vectors.
Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
Fixed point
If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.
There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a2^4 and b2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.
float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale; // fixed point scale for a: 2^4
const int bscale = 1<<logbscale; // fixed point scale for b: 2^1
const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale)); // re-scale b to match a in the 16bit temporary result
for (i=0 ; i<n; i+=16) {
//__m128i avec = get_scaled_a(i);
//__m128i bvec = get_scaled_b(i);
//__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
//__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);
__m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) ); // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.
__m128i block = _mm_load_si128(&buf[i]); // call this { v[0] .. v[15] }
__m128i lo = _mm_unpacklo_epi8(block, brescale); // {v[0], 8, v[1], 8, ...}
__m128i hi = _mm_unpackhi_epi8(block, brescale); // {v[8], 8, v[9], 8, ...
lo = _mm_maddubs_epi16(lo, abvec); // first arg is unsigned bytes, 2nd arg is signed bytes
hi = _mm_maddubs_epi16(hi, abvec);
// lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }
lo = _mm_srli_epi16(lo, logascale); // truncate from scaled fixed-point to integer
hi = _mm_srli_epi16(hi, logascale);
// and re-pack. Logical, not arithmetic right shift means sign bits can't be set
block = _mm_packuswb(lo, hi);
_mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.
Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.
For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:
_mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
_mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
_mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
_mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.
To use these, you'd get scaled 16b vectors of a and b values, then:
unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
left shift the words by 7.
multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
right shift here if you wanted different scales for a and b, to get more fractional precision for a
add b to that.
right shift to do the final truncation back from fixed point to [0..255].
With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.
If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.
expand bytes to floats
If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.
For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.
This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)
char *buf;
// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
__m128 a = get_a(i);
__m128 b = get_b(i);
// IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128. (unlike punpck*)
__m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i)); // load 4B at once.
// Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
floats = _mm_fmadd_ps(floats, a, b); // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
// or without FMA, do this with _mm_mul_ps and _mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(floats);
// repeat 3 more times for buf+4, buf+8, and buf+12, then:
__m128i packed01 = _mm_packss_epi32(dwords0, dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
// packuswb wants SIGNED input, so do signed saturation on the first step
// saturate into [0..255] range
__m12i8 packedbytes=_mm_packus_epi16(packed01, packed23); // SSE2
_mm_store_si128(buf+i, packedbytes); // or storeu if buf isn't aligned.
}
// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0
(Re: a load that can be a memory source operand for pmovzxbd, see also Loading 8 chars from memory into an __m256 variable as packed single precision floats re: the problems compilers have with this.) And see also GCC bug 99754 - wrong code for _mm_loadu_si32 - reversed vector elements.
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.
See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.

I guess you're looking fro the __m128 _mm_cvtpi8_ps(__m64 a ) composite intrinsic.
Here is a minimal example:
#include <xmmintrin.h>
#include <stdio.h>
int main() {
unsigned char a[4] __attribute__((aligned(32)))= {1,2,3,4};
float b[4] __attribute__((aligned(32)));
_mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
return 0;
}

How to quickly replicate a 6-byte unsigned integer into a memory region?

I need to replicate a 6-byte integer value into a memory region, starting with its beginning and as quickly as possible. If such an operation is supported in hardware, I'd like to use it (I'm on x64 processors now, compiler is GCC 4.6.3).
The memset doesn't suit the job, because it can replicate bytes only. The std::fill isn't good either, because I even can't define an iterator, jumping between 6 byte-width positions in the memory region.
So, I'd like to have a function:
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num)
This looks like memset, but there is an additional argument width to define how much bytes from the value to replicate. If something like that could be expressed in C++, that would be even better.
I already know about obvious myMemset implementation, which would call the memcpy in loop with last argument (bytes to copy) equal to the width. Also I know, that I can define a temporary memory region with size 6 * 8 = 48 bytes, fill it up with 6-byte integers and then memcpy it to the destination area.
Can we do better?

Something along #Mark Ransom comment:
Copy 6 bytes, then 6, 12, 24, 48, 96, etc.
void memcpy6(void *dest, const void *src, size_t n /* number of 6 byte blocks */) {
if (n-- == 0) {
return;
}
memcpy(dest, src, 6);
size_t width = 1;
while (n >= width) {
memcpy(&((char *) dest)[width * 6], dest, width * 6);
n -= width;
width <<= 1; // double w
}
if (n > 0) {
memcpy(&((char *) dest)[width * 6], dest, n * 6);
}
}
Optimization: scale n and width by 6.
[Edit]
Corrected destination #SchighSchagh
Added cast (char *)

Determine the most efficient write size that the CPU supports; then find the smallest number that can be evenly divided by both 6 and that write size and call that "block size".
Now split the memory region up into blocks of that size. Each block will be identical and all writes will be correctly aligned (assuming the memory region itself is correctly aligned).
For example, if the most efficient write size that the CPU supports is 4 bytes (e.g. ancient 80486) then the "size of block" would be 12 bytes. You'd set 3 general purpose registers and do 3 stores per block.
For another example, if the most efficient write size that the CPU supports is 16 bytes (e.g. SSE) then the "size of block" would be 48 bytes. You'd set 3 SSE registers and do 3 stores per block.
Also, I'd recommend rounding the size of the memory region up to ensure it is a multiple of the block size (with some "not strictly necessary" padding). A few unnecessary writes are less expensive than code to fill a "partial block".
The second most efficient method might be to use a memory copy (but not memcpy() or memmove()). In this case you'd write the initial 6 bytes (or 12 bytes or 48 bytes or whatever), then copy from (e.g.) &area[0] to &area[6] (working from lowest to highest) until you reach the end. For this memmove() will not work because it will notice the area is overlapping and work from highest to lowest instead; and memcpy() will not work because it assumes the source and destination do not overlap; so you'd have to create your own memory copy to suit. The main problem with this is that you double the number of memory accesses - "reading and writing" is slower than "writing alone".

If your Num is large enough, you can try using the AVX vector instructions that will handle 32 bytes at a time (_mm256_load_si256/_mm256_store_si256 or their unaligned variants).
As 32 is not a multiple of 6, you will have to first replicate the 6 bytes pattern 16 times using short memcpy's or 32/64 bits moves.
ABCDEF
ABCDEF|ABCDEF
ABCD EFAB CDEF|ABCD EFAB CDEF
ABCDEFAB CDEFABCD EFABCDEF|ABCDEFAB CDEFABCD EFABCDE
ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF|ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF
You will also finish with a short memcpy.

Try the __movsq intrinsic (x64 only; in assembly, rep movsq) that will move 8 bytes at a time, with a suitable repetition factor, and setting the destination address 6 bytes after the source. Check that overlapping addresses are handled smartly.

Write 8 bytes at a time.
Being on a 64-bit machine, certainly the generated code can operate well with 8-byte writes. After dealing with some set-up issues, in a tight loop, write 8-bytes per write about num times. Assumptions apply - see code.
// assume little endian
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num) {
assert(width > 0 && width <= 8);
uint64_t *ptr64 = (uint64_t *) ptr;
// # to stop early to prevent writing past array end
static const unsigned stop_early[8 + 1] = { 0, 8, 3, 2, 1, 1, 1, 1, 0 };
size_t se = stop_early[width];
if (num > se) {
num -= se;
// assume no bus-fault with 64-bit write # `ptr64, ptr64+1, ... ptr64+7`
while (num > 0) { // tight loop
num--;
*ptr64 = value;
ptr64 = (uint64_t *) ((char *) ptr64 + width);
}
ptr = ptr64;
num = se;
}
// Cope with last few writes
while (num-- > 0) {
memcpy(ptr, &value, width);
ptr = (char *) ptr + width;
}
}
Further optimization includes writing 2 blocks at a time width == 3 or 4, 4 blocks at a time when width == 2 and 8 blocks at a time width == 1.

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?

The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);

Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.

I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}

To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)

unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}

Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js