TL; DR How to safely perfom a single bit update A[n/8] |= (1<<n%8); for A being a huge array of chars (i.e., setting n's bit of A true) when computing in parallel using C++11's <thread> library?
I'm performing a computation that's easy to parallelize. I'm computing elements of a certain subset of the natural numbers, and I wanna find elements that are not in the subset. For this I create a huge array (like A = new char[20l*1024l*1024l*1024l], i.e., 20GiB). A n's bit of this array is true if n lies in my set.
When doing it in parallel and setting the bits true using A[n/8] |= (1<<n%8);, I seem to get a small loss of information, supposedly due to concurring work on the same byte of A (each thread has to first read the byte, update the single bit and write the byte back). How can I get around this? Is there a way how to do this update as an atomic operation?
The code follows. GCC version: g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609. The machine is an 8-core Intel(R) Xeon(R) CPU E5620 # 2.40GHz, 37GB RAM. Compiler options: g++ -std=c++11 -pthread -O3
#include <iostream>
#include <thread>
typedef long long myint; // long long to be sure
const myint max_A = 20ll*1024ll*1024ll; // 20 MiB for testing
//const myint max_A = 20ll*1024ll*1024ll*1024ll; // 20 GiB in the real code
const myint n_threads = 1; // Number of threads
const myint prime = 1543; // Tested prime
char *A;
const myint max_n = 8*max_A;
inline char getA(myint n) { return A[n/8] & (1<<(n%8)); }
inline void setAtrue(myint n) { A[n/8] |= (1<<n%8); }
void run_thread(myint startpoint) {
// Calculate all values of x^2 + 2y^2 + prime*z^2 up to max_n
// We loop through x == startpoint (mod n_threads)
for(myint x = startpoint; 1*x*x < max_n; x+=n_threads)
for(myint y = 0; 1*x*x + 2*y*y < max_n; y++)
for(myint z = 0; 1*x*x + 2*y*y + prime*z*z < max_n; z++)
setAtrue(1*x*x + 2*y*y + prime*z*z);
}
int main() {
myint n;
// Only n_threads-1 threads, as we will use the master thread as well
std::thread T[n_threads-1];
// Initialize the array
A = new char[max_A]();
// Start the threads
for(n = 0; n < n_threads-1; n++) T[n] = std::thread(run_thread, n);
// We use also the master thread
run_thread(n_threads-1);
// Synchronize
for(n = 0; n < n_threads-1; n++) T[n].join();
// Print and count all elements not in the set and n != 0 (mod prime)
myint cnt = 0;
for(n=0; n<max_n; n++) if(( !getA(n) )&&( n%1543 != 0 )) {
std::cout << n << std::endl;
cnt++;
}
std::cout << "cnt = " << cnt << std::endl;
return 0;
}
When n_threads = 1, I get the correct value cnt = 29289. When n_threads = 7, I got cnt = 29314 and cnt = 29321 on two different calls, suggesting some of the bitwise operations on a single byte were concurring.
std::atomic provides all the facilities that you need here:
std::array<std::atomic<char>, max_A> A;
static_assert(sizeof(A[0]) == 1, "Shall not have memory overhead");
static_assert(std::atomic<char>::is_always_lock_free,
"No software-level locking needed on common platforms");
inline char getA(myint n) { return A[n / 8] & (1 << (n % 8)); }
inline void setAtrue(myint n) { A[n / 8].fetch_or(1 << n % 8); }
The load in getA is atomic (equivalent to load()), and std::atomic even has built-in support for oring the stored value with another one (fetch_or), atomically of course.
When initializing A, the naive way of for (auto& a : A) a = 0; would require synchronization after every store, which you can avoid by waiving some thread-safety. std::memory_order_release only requires that what we write is visible to other threads (but not that other thread's writes are visible to us). And indeed, if you do
// Initialize the array
for (auto& a : A)
a.store(0, std::memory_order_release);
you get the safety you need without any assembly-level synchronization on x86. You could do the reverse for the loads after the threads finish, but that has no added benefit on x86 (it's just a mov either way).
Demo on the full code: https://godbolt.org/z/nLPlv1
Related
My task is to design a function that fulfils those requirements:
Function shall sum members of given one-dimensional array. However, it should sum only members whose number of ones in the binary representation is higher than defined threshold (e.g. if the threshold is 4, number 255 will be counted and 15 will not)
The array length is arbitrary
The function shall utilize as little memory as possible and shall be written in an efficient way
The production function code (‘sum_filtered(){..}’) shall not use any standard C library functions (or any other libraries)
The function shall return 0 on success and error code on error
The array elements are of a type 16-bit signed integer and an overflow during calculation shall be regarded as a failure
Use data types that ensure portability between different CPUs (so the calculations will be the same on 8/16/32-bit MCU)
The function code should contain a reasonable amount of comments in doxygen annotation
Here is my solution:
#include <iostream>
using namespace std;
int sum_filtered(short array[], int treshold)
{
// return 1 if invalid input parameters
if((treshold < 0) || (treshold > 16)){return(1);}
int sum = 0;
int bitcnt = 0;
for(int i=0; i < sizeof(array); i++)
{
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
// Add integer to sum if bitcnt>treshold
if(bitcnt>treshold){sum += array[i];}
}
return(0);
}
int main()
{
short array[5] = {15, 2652, 14, 1562, -115324};
int result = sum_filtered(array, 14);
cout << result << endl;
short array2[5] = {15, 2652, 14, 1562, 15324};
result = sum_filtered(array2, -2);
cout << result << endl;
}
However I'm not sure whether this code is portable between different CPUs.
And I don't how can an overflow occur during calculation and what can be other errors during processing of arrays with this function.
Can somebody more experienced give me his opinion?
Well, I can foresee one problem:
for(int i=0; i < sizeof(array); i++)
array in this context is a pointer, so will likely be 4 on 32bit systems, or 8 on 64bit systems. You really do want to be passing a count variable (in this case 5) into the sum_filtered function (and then you can pass the count as sizeof(array) / sizeof(short)).
Anyhow, this code:
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
Effectively you are doing a popcount here (which can be done using __builtin_popcount on gcc/clang, or __popcnt on MSVC. They are compiler specific, but usually boil down to a single popcount CPU instruction on most CPUs).
If you do want to do this the slow way, then an efficient approach is to treat the computation as a form of bitwise SIMD operation:
#include <cstdint> // or stdint.h if you have a rubbish compiler :)
uint16_t popcount(uint16_t s)
{
// perform 8x 1bit adds
uint16_t a0 = s & 0x5555;
uint16_t b0 = (s >> 1) & 0x5555;
uint16_t s0 = a0 + b0;
// perform 4x 2bit adds
uint16_t a1 = s0 & 0x3333;
uint16_t b1 = (s0 >> 2) & 0x3333;
uint16_t s1 = a1 + b1;
// perform 2x 4bit adds
uint16_t a2 = s1 & 0x0F0F;
uint16_t b2 = (s1 >> 4) & 0x0F0F;
uint16_t s2 = a2 + b2;
// perform 1x 8bit adds
uint16_t a3 = s2 & 0x00FF;
uint16_t b3 = (s2 >> 8) & 0x00FF;
return a3 + b3;
}
I know it says you can't use stdlib functions (your 4th point), but that shouldn't apply to the standardised integer types surely? (e.g. uint16_t) If it does, well then there is no way to guarantee portability across platforms. You're out of luck.
Personally I'd just use a 64bit integer for the sum. That should reduce the risk of any overflows *(i.e. if the threshold is zero, and all the values are -128, then you'd overflow if the array size exceeded 0x1FFFFFFFFFFFF elements (562,949,953,421,311 in decimal).
#include <cstdint>
int64_t sum_filtered(int16_t array[], uint16_t threshold, size_t array_length)
{
// changing the type on threshold to be unsigned means we don't need to test
// for negative numbers.
if(threshold > 16) { return 1; }
int64_t sum = 0;
for(size_t i=0; i < array_length; i++)
{
if (popcount(array[i]) > threshold)
{
sum += array[i];
}
}
return sum;
}
Input is a bitarray stored in contiguous memory with 1 bit of the bitarray per 1 bit of memory.
Output is an array of the indices of set bits of the bitarray.
Example:
bitarray: 0000 1111 0101 1010
setA: {4,5,6,7,9,11,12,14}
setB: {2,4,5,7,9,10,11,12}
Getting either set A or set B is fine.
The set is stored as an array of uint32_t so each element of the set is an unsigned 32 bit integer in the array.
How to do this about 5 times faster on a single cpu core?
current code:
#include <iostream>
#include <vector>
#include <time.h>
using namespace std;
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
for(i = 0; i < size; i++){
find_set_bit(v[i], ptr_set_new, base);
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
int k = base;
while(n){
if (n & 1){
*(ptr_set) = k;
ptr_set++;
}
n = n >> 1;
k++;
}
}
template <typename T>
void rand_vector(T& v){
srand(time(NULL));
int i;
int size = v.capacity();
for (i=0;i<size;i++){
v[i] = rand();
}
}
template <typename T>
void print_vector(T& v, int size_in = 0){
int i;
int size;
if (size_in == 0){
size = v.capacity();
} else {
size = size_in;
}
for (i=0;i<size;i++){
cout << v[i] << ' ';
}
cout << endl;
}
int main(void){
const int test_size = 6000;
vector<uint32_t> vec(test_size);
vector<uint32_t> set(test_size*sizeof(uint32_t)*8);
rand_vector(vec);
//for (int i; i < 64; i++) vec[i] = -1;
//cout << "input" << endl;
print_vector(vec);
//cout << "calculate result" << endl;
int i;
int rep = 10000;
uint32_t res_size;
struct timespec tp_start, tp_end;
clock_gettime(CLOCK_MONOTONIC, &tp_start);
for (i=0;i<rep;i++){
res_size = bitarray2set(vec, set.data());
}
clock_gettime(CLOCK_MONOTONIC, &tp_end);
double timing;
const double nano = 0.000000001;
timing = ((double)(tp_end.tv_sec - tp_start.tv_sec )
+ (tp_end.tv_nsec - tp_start.tv_nsec) * nano) /(rep);
cout << "timing per cycle: " << timing << endl;
cout << "print result" << endl;
//print_vector(set, res_size);
}
result (compiled with icc -O3 code.cpp -lrt)
...
timing per cycle: 0.000739613 (7.4E-4).
print result
0.0008 seconds to convert 768000 bits to set. But there are at least 10,000 arrays of 768,000 bits in each cycle. That is 8 seconds per cycle. That is slow.
The cpu has popcnt instruction and sse4.2 instruction set.
Thanks.
Update
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
uint32_t * ptr_v;
uint32_t * ptr_v_end = &(v[size]);
for(ptr_v = v.data(); ptr_v < ptr_v_end; ++ptr_v){
while(*ptr_v) {
*ptr_set_new++ = base + __builtin_ctz(*ptr_v);
(*ptr_v) &= (*ptr_v) - 1; // zeros the lowest 1-bit in n
}
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
This updated version uses the inner loop provided by rhashimoto. I don't know if the inlining actually makes the function slower (i never thought that can happen!). The new timing is 1.14E-5 (compiled by icc -O3 code.cpp -lrt, and benchmarked on random vector).
Warning:
I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at Resizing a C++ std::vector<char> without initializing data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each element of the vector... at least vectors actually uses contiguous memory, like a plain array (Are std::vector elements guaranteed to be contiguous?)
I notice that you use .capacity() when you probably mean to use .size(). That could make you do extra unnecessary work, as well as giving you the wrong answer.
Your loop in find_set_bit() iterates over all 32 bits in the word. You can instead iterate only over each set bit and use the BSF instruction to determine the index of the lowest bit. GCC has an intrinsic function __builtin_ctz() to generate BSF or the equivalent - I think that the Intel compiler also supports it (you can inline assembly if not). The modified function would look like this:
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
while(n) {
*ptr_set++ = base + __builtin_ctz(n);
n &= n - 1; // zeros the lowest 1-bit in n
}
}
On my Linux machine, compiling with g++ -O3, replacing that function drops the reported time from 0.000531434 to 0.000101352.
There are quite a few ways to find a bit index in the answers to this question. I do think that __builtin_ctz() is going to be the best choice for you, though. I don't believe that there is a reasonable SIMD approach to your problem, as each input word produces a variable amount of output.
As suggested by #davidbak, you could use a table lookup to process 4 elements of the bitmap at once.
Each lookup produces a variable-sized chunk of set members, which we can handle by using popcnt.
#rhashimoto's scalar ctz-based suggestion will probably do better with sparse bitsets that have lots of zeros, but this should be better when there are a lot of set bits.
I'm thinking something like
// a vector of 4 elements for every pattern of 4 bits.
// values range from 0 to 3, and will have a multiple of 4 added to them.
alignas(16) static const int LUT[16*4] = { 0,0,0,0, ... };
// mostly C, some pseudocode.
unsigned int bitmap2set(int *set, int input) {
int *set_start = set;
__m128i offset = _mm_setzero_si128();
for (nibble in input[]) { // pseudocode for the actual shifting / masking
__m128i v = _mm_load_si128(&LUT[nibble]);
__m128i vpos = _mm_add_epi32(v, offset);
_mm_store((__m128i*)set, vpos);
set += _mm_popcount_u32(nibble); // variable-length store
offset = _mm_add_epi32(offset, _mm_set1_epi32(4)); // increment the offset by 4
}
return set - set_start; // set size
}
When a nibble isn't 1111, the next store will overlap, but that's fine.
Using popcnt to figure out how much to increment a pointer is a useful technique in general for left-packing variable-length data into a destination array.
I'd like to generate all possible combination (without repetitions) in bit representation. I can't use any library like boost or stl::next_combination - it has to be my own code (computation time is very important).
Here's my code (modified from ones StackOverflow user):
int combination = (1 << k) - 1;
int new_combination = 0;
int change = 0;
while (true)
{
// return next combination
cout << combination << endl;
// find first index to update
int indexToUpdate = k;
while (indexToUpdate > 0 && GetBitPositionByNr(combination, indexToUpdate)>= n - k + indexToUpdate)
indexToUpdate--;
if (indexToUpdate == 1) change = 1; // move all bites to the left by one position
if (indexToUpdate <= 0) break; // done
// update combination indices
new_combination = 0;
for (int combIndex = GetBitPositionByNr(combination, indexToUpdate) - 1; indexToUpdate <= k; indexToUpdate++, combIndex++)
{
if(change)
{
new_combination |= (1 << (combIndex + 1));
}
else
{
combination = combination & (~(1 << combIndex));
combination |= (1 << (combIndex + 1));
}
}
if(change) combination = new_combination;
change = 0;
}
where n - all elements, k - number of elements in combination.
GetBitPositionByNr - return position of k-th bit.
GetBitPositionByNr(13,2) = 3 cause 13 is 1101 and second bit is on third position.
It gives me correct output for n=4, k=2 which is:
0011 (3 - decimal representation - printed value)
0101 (5)
1001 (9)
0110 (6)
1010 (10)
1100 (12)
Also it gives me correct output for k=1 and k=4, but gives me wrong outpu for k=3 which is:
0111 (7)
1011 (11)
1011 (9) - wrong, should be 13
1110 (14)
I guess the problem is in inner while condition (second) but I don't know how to fix this.
Maybe some of you know better (faster) algorithm to do want I want to achieve? It can't use additional memory (arrays).
Here is code to run on ideone: IDEONE
When in doubt, use brute force. Alas, generate all variations with repetition, then filter out the unnecessary patterns:
unsigned bit_count(unsigned n)
{
unsigned i = 0;
while (n) {
i += n & 1;
n >>= 1;
}
return i;
}
int main()
{
std::vector<unsigned> combs;
const unsigned N = 4;
const unsigned K = 3;
for (int i = 0; i < (1 << N); i++) {
if (bit_count(i) == K) {
combs.push_back(i);
}
}
// and print 'combs' here
}
Edit: Someone else already pointed out a solution without filtering and brute force, but I'm still going to give you a few hints about this algorithm:
most compilers offer some sort of intrinsic population count function. I know of GCC and Clang which have __builtin_popcount(). Using this intrinsic function, I was able to double the speed of the code.
Since you seem to be working on GPUs, you can parallelize the code. I have done it using C++11's standard threading facilities, and I've managed to compute all 32-bit repetitions for arbitrarily-chosen popcounts 1, 16 and 19 in 7.1 seconds on my 8-core Intel machine.
Here's the final code I've written:
#include <vector>
#include <cstdio>
#include <thread>
#include <utility>
#include <future>
unsigned popcount_range(unsigned popcount, unsigned long min, unsigned long max)
{
unsigned n = 0;
for (unsigned long i = min; i < max; i++) {
n += __builtin_popcount(i) == popcount;
}
return n;
}
int main()
{
const unsigned N = 32;
const unsigned K = 16;
const unsigned N_cores = 8;
const unsigned long Max = 1ul << N;
const unsigned long N_per_core = Max / N_cores;
std::vector<std::future<unsigned>> v;
for (unsigned core = 0; core < N_cores; core++) {
unsigned long core_min = N_per_core * core;
unsigned long core_max = core_min + N_per_core;
auto fut = std::async(
std::launch::async,
popcount_range,
K,
core_min,
core_max
);
v.push_back(std::move(fut));
}
unsigned final_count = 0;
for (auto &fut : v) {
final_count += fut.get();
}
printf("%u\n", final_count);
return 0;
}
The run time of following code, parallel comparsion, takes forever, when the number of key in the map is huge(e.g 100000) and each of its second element have huge element(e.g 100000) as well.
Is there any possible way to speed up the the comparsion? My cpu is Xeon E5450 3.00G 4 Cores. Ram is fair enough.
// There is a map with long as its key and vector<long> as second element,
// the vector's elements are increasing sorted.
map<long, vector<long> > = aMap() ;
map<long, vector<long> >::iterator it1 = aMap.begin() ;
map<long, vector<long> >::iterator it2;
// the code need compare each key's second elements
for( ; it1 != aMap.end(); it1++ ) {
it2 = it1;
it2++;
// Parallel comparsion: THE MOST TIME CONSUMING PART
for( ; it2 != aMap.end(); it2++ ) {
unsigned long i = 0, j = 0, _union = 0, _inter = 0 ;
while( i < it1->second.size() && j < it2->second.size() ) {
if( it1->second[i] < it2->second[j] ) {
i++;
} else if( it1->second[i] > it2->second[j] ) {
j++;
} else {
i++; j++; _inter++;
}
}
_union = it1->second.size() + it2->second.size() - _inter;
if ( (double) _inter / _union > THRESH )
cout << it1->first << " might appears frequently with " << it2->first << endl;
}
}
(This is not a complete answer. It only solves part of your problem; the part about bit manipulation.)
Here's a class you might be able to use to calculate the number of intersections between two sets (the cardinality of the intersection) rather quickly.
It uses a bit vector to store the sets, which means the universe of the possible set members must be small.
#include <cassert>
#include <vector>
class BitVector
{
// IMPORTANT: U must be unsigned
// IMPORTANT: use unsigned long long in 64-bit builds.
typedef unsigned long U;
static const unsigned UBits = 8 * sizeof(U);
public:
BitVector (unsigned size)
: m_bits ((size + UBits - 1) / UBits, 0)
, m_size (size)
{
}
void set (unsigned bit)
{
assert (bit < m_size);
m_bits[bit / UBits] |= (U)1 << (bit % UBits);
}
void clear (unsigned bit)
{
assert (bit < m_size);
m_bits[bit / UBits] &= ~((U)1 << (bit % UBits));
}
unsigned countIntersect (BitVector const & that) const
{
assert (m_size == that.m_size);
unsigned ret = 0;
for (std::vector<U>::const_iterator i = m_bits.cbegin(),
j = that.m_bits.cbegin(), e = m_bits.cend(), f = that.m_bits.cend();
i != e && j != f; ++i, ++j)
{
U x = *i & *j;
// Count the number of 1 bits in x and add it to ret
// There are much better ways than this,
// e.g. using the POPCNT instruction or intrinsic
while (x != 0)
{
ret += x & 1;
x >>= 1;
}
}
return ret;
}
unsigned countUnion (BitVector const & that) const
{
assert (m_size == that.m_size);
unsigned ret = 0;
for (std::vector<U>::const_iterator i = m_bits.cbegin(),
j = that.m_bits.cbegin(), e = m_bits.cend(), f = that.m_bits.cend();
i != e && j != f; ++i, ++j)
{
U x = *i | *j;
while (x != 0)
{
ret += x & 1;
x >>= 1;
}
}
return ret;
}
private:
std::vector<U> m_bits;
unsigned m_size;
};
And here's a very small test program to see how you can use the above class. It makes two sets (each with 100K maximum elements), adds some items to them (using the set() member function) and then calculate their intersection 1 billion times. It runs in under two seconds on my machine.
#include <iostream>
using namespace std;
int main ()
{
unsigned const SetSize = 100000;
BitVector a (SetSize), b (SetSize);
for (unsigned i = 0; i < SetSize; i += 2) a.set (i);
for (unsigned i = 0; i < SetSize; i += 3) b.set (i);
unsigned x = a.countIntersect (b);
cout << x << endl;
return 0;
}
Don't forget to compile this with optimizations enabled! Otherwise it performs very badly.
POPCNT
Modern processors have an instruction to count the number of set bits in a word, called POPCNT. This is quite a lot faster than doing the naive thing written above (as a side note, there are faster ways to do it in software as well, but I didn't want to pollute the code.)
Anyways, the way to use POPCNT in C/C++ code is to use a compiler intrinsic or built-in. In MSVC, you can use __popcount() intrinsic that works on 32-bit integers. In GCC, you can use __builtin_popcountl() for 32-bit integers and __builtin_popcountll() for 64 bits. Be warned that these functions may not be available due to your compiler version, target architecture and/or compile switches.
Maybe you would like to try PPL. Or some of its analogues. I don't really understand what your code suppose to do, as it doesn't seem to have any output. But side effects free code can be effectively parallelized with tools like Parallel Patterns Library.
I have a bit-mask of N chars in size, which is statically known (i.e. can be calculated at compile time, but it's not a single constant, so I can't just write it down), with bits set to 1 denoting the "wanted" bits. And I have a value of the same size, which is only known at runtime. I want to collect the "wanted" bits from that value, in order, into the beginning of a new value. For simplicity's sake let's assume the number of wanted bits is <= 32.
Completely unoptimized reference code which hopefully has the correct behaviour:
template<int N, const char mask[N]>
unsigned gather_bits(const char* val)
{
unsigned result = 0;
char* result_p = (char*)&result;
int pos = 0;
for (int i = 0; i < N * CHAR_BIT; i++)
{
if (mask[i/CHAR_BIT] & (1 << (i % CHAR_BIT)))
{
if (val[i/CHAR_BIT] & (1 << (i % CHAR_BIT)))
{
if (pos < sizeof(unsigned) * CHAR_BIT)
{
result_p[pos/CHAR_BIT] |= 1 << (pos % CHAR_BIT);
}
else
{
abort();
}
}
pos += 1;
}
}
return result;
}
Although I'm not sure whether that formulation actually allows access to the contents of the mask at compile time. But in any case, it's available for use, maybe a constexpr function or something would be a better idea. I'm not looking here for the necessary C++ wizardry (I'll figure that out), just the algorithm.
An example of input/output, with 16-bit values and imaginary binary notation for clarity:
mask = 0b0011011100100110
val = 0b0101000101110011
--
wanted = 0b__01_001__1__01_ // retain only those bits which are set in the mask
result = 0b0000000001001101 // bring them to the front
^ gathered bits begin here
My questions are:
What's the most performant way to do this? (Are there any hardware instructions that can help?)
What if both the mask and the value are restricted to be unsigned, so a single word, instead of an unbounded char array? Can it then be done with a fixed, short sequence of instructions?
There will pext (parallel bit extract) that does exactly what you want in Intel Haswell. I don't know what the performance of that instruction will be, probably better than the alternatives though. This operation is also known as "compress-right" or simply "compress", the implementation from Hacker's Delight is this:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}